Indexed body searches

author: Paul J Stevens

IMAP searches as of dbmail-2.2 are very fast and efficient since they are purely database driven. The only exception here are search actions that require access to the body of mail messages. This design seeks to address this issue.

Although the IMAP RFC requires that only text/plain messageparts are searched, being able to search the contents of other content types will add to dbmail's value. To achieve this we propose to add a separate table that will contain the words from each attachment.

The kind of query used to search this new table will depend partly on the database engine in use. As the engines further develop these queries can be adapted to achieve better performance and reliability. In MySQL/MyISAM tables, and in PostgreSQL/tsearch tables we'll be able to use full-text indexing and searching. As lowest common denominator we'll always be able to use LIKE queries. Even when relatively slow compared to full-text searching, this will still signify a major improvement over the current setup.

Construction and maintenance of the words table will require access to external utilities for parsing and converting non-text attachments. Converters for MSWord documents, Excel spreadsheets, PDF files etc are readily available. However, since this parsing and converting will be relatively slow and resource intensive, we propose to offload this to a separate daemon so it may run on a separate server. This daemon will have access to the dbmail database, to the converters, and will listen on a network port for requests to (re)index certain messages. It will also be possible to request a revisiting of all attachments of a certain type, i.e. after a converter was upgraded.

NOTE: IMAP RFC requires that body searches must match substrings also (e.g. “ello” search key will find “hello”). So the above design can't do indexed searches for text/plain parts without violating IMAP spec. Although doing a non-indexed LIKE %ello% lookup in words table could still be faster than doing it for all message bodies.

Proposal: What about a configure-time option called use fulltext search as default. When this is enabled, the search vialoates the IMAP spec but searches fast via fulltext search. But when the user places a special character (% or ~) in the search, it switches to the “like” search. Or vice versa: switching to fulltext search with a ~ at the beginning.

bodysearch.txt · Last modified: 2012/02/27 21:45 by bas
DBMail is developed by Paul J Stevens together with developers world-wide