update: dbmail supports unicode by storing messageblks as bytea fields in PG since 2.1.6 (Paul)

As of 2.0.7 (and it looks like 2.1.1 also), dbmail uses PostgreSQL's “text” type for the “messageblk” column of the “dbmail_messageblks” table. This presents a problem when the PostgreSQL database uses the UNICODE encoding, and the user attempts to store non-ASCII, non-UTF-8 message, e.g. one encoded in KOI8-R:

Message-ID: <429E8935.50303@my-domain.edu>
Date: Wed, 01 Jun 2005 21:21:09 -0700
From: Joe Blow <me@somewhere.edu>
MIME-Version: 1.0
To:  jane@nowhere.edu
Subject: test KOI8-R message
Content-Type: text/plain; charset=KOI8-R; format=flowed
Content-Transfer-Encoding: 8bit

This is a KOI8-R message.  Pretend it has octets above 127.

Since the “messageblk” attribute is type “text”, PostgreSQL will try to interpret this byte stream as UTF-8 text and complain of invalid UTF-8 sequences in the KOI8-R data. Note that this does not typically happen in a compound MIME message because non-7bit parts are typically base64 or quoted-printable encoded — at least this is what Thunderbird seems to do.

All of this works out when the PostgreSQL database is ASCII (which PostgreSQL silently treats as ISO 8859-1). Even if the message is in some other encoding (single byte or multi-byte — doesn't matter), PostgreSQL will just assume it's ISO 8859-1 and happily store and retrieve it. But unfortunately the encoding is database-wide, so any other tables in the same database (e.g. non-dbmail tables) must also be ASCII. This would be restrictive, if, e.g. one wanted to integrate an addressbook into the same database (probably in a separate schema) to gain referential integrity, etc.

I've managed to solve this by changing the messageblk column from type “text” to type “bytea”, thus making message blocks opaque binary blocks. This is fine since nothing in dbmail depends on that column being textual (e.g. dbmail would never need to do a locale-sensitive “order by” on that column). Basically, everywhere there is a query that returns the “messageblk” column, it gets changed to encode(messageblk, 'escape'), and db_escape_direct() in dbpgsql.c needs to use PQescapeBytea() instead of PQescapeString() (these two functions strangely have different call semantics).

Two different patches for this have been posted to the developers' mailing list (see links below).

See also: bug 218

This has been discussed at least three times on the developers' mailing list:

college essay

unicode_postgresql_database.txt · Last modified: 2011/08/04 10:12 by jardenarobinson
DBMail is developed by Paul J Stevens together with developers world-wide