High availability - monitoring services

All software has bugs and DBMail is unexceptional. Ensuring healthy services requires monitoring and remedial action when a service goes outside defined boundaries.

There are many network monitoring tools, this article uses Nagios but the principles are straightforward and easily converted between systems.

Although it's essential to have a service running at all times, beware of thundering herds. Email is a disconnected service and most users won't notice if it's down for a few minutes. With careful monitoring it's possible to have downtime of less than a second.

Availability

All services should be available, if they're not then there's an issue that needs resolving.

For example checking a healthy system provides a positive response:

/usr/local/libexec/nagios/check_imap --hostname=example.com
IMAP OK - 0.001 second response time on example.com port 143 [* OK [CAPABILITY IMAP4rev1 AUTH=LOGIN STARTTLS ID LITERAL+] dbmail 3.5.1 ready.]|time=0.00
0914s;;;0.000000;10.000000

If there is an issue with the service then the response depends on a few heuristics, Nagios has user configured levels as Warning and Critical, plus Soft and Hard modifiers.

A warning might configured to be triggered with a delay of three seconds whereas critical might be triggered if the delay is longer than ten seconds.

Understanding your usage patterns is essential to getting the balance right. Too aggressive and you might compound a busy or overloaded system.

If it's a recent "soft" error, then restarting the daemon can be an easy fix.

If the error is persistent aka critical, then restarting the daemon may have failed, in this case try again with a timeout then force stop it then restart with:

timeout 3 service dbmail-imapd stop
killall -v -KILL dbmail-imapd
sleep 1
service dbmail-imapd start

Memory usage

DBMail has fixed all of the easy memory leaks, as a few remain it's prudent to monitor memory usage and restart the service when memory usage goes above a certain level.

It's not uncommon to see memory usage in hundreds of megabytes, but excessive memory usage can be unhelpful. Current practice suggests 1Gb is a useful target to restart a service.

Here's a healthy check:

/usr/local/libexec/nagios/check_procs -w 1024000 -c 1025000 --metric=VSZ --command=dbmail-imapd
VSZ OK: 1 process with command name 'dbmail-imapd' | procs=1;;;0; procs_warn=0;;;0; procs_crit=0;;;0; procvsz=308584;

Database response

DBMail is designed to be fast. Adding rarely used indexes slows a system down so the schema is designed to be good for most use cases. Although most queries are sub second, occasionally they are longer; the defaults can be changed:

query_time_info       = 10
query_time_notice     = 20
query_time_warning    = 30

On a heavily used system it's not uncommon to see delays of a few seconds, the following is one of the longest at twenty five seconds and whilst unusual, it's very rarely seen and so there's no need to optimise it:

dbmail/imap4d[92064] Warning:[db] log_query_time(+387): slow query [
SELECT
m.message_idnr, n.headername, v.headervalue,
CASE
WHEN n.headername='date' THEN 0
WHEN n.headername='subject' THEN 1
WHEN n.headername='from' THEN 2
WHEN n.headername='content-type' THEN 3
WHEN n.headername='to' THEN 4
WHEN n.headername='cc' THEN 5
WHEN n.headername='bcc' THEN 6
WHEN n.headername='reply-to' THEN 7
WHEN n.headername='message-id' THEN 8
WHEN n.headername='references' THEN 9
WHEN n.headername='in-reply-to' THEN 10
WHEN n.headername='list-post' THEN 11
WHEN n.headername='list-unsubscribe' THEN 12
WHEN n.headername='sender' THEN 13
WHEN n.headername='x-k9mail-identity' THEN 14
WHEN n.headername='chat-version' THEN 15
WHEN TRUE THEN 1
ELSE 16
END AS seq
FROM dbmail_header h
LEFT JOIN dbmail_messages m ON h.physmessage_id=m.physmessage_id
LEFT JOIN dbmail_headername n ON h.headername_id=n.id
LEFT JOIN dbmail_headervalue v ON h.headervalue_id=v.id
WHERE h.headername_id  IN (0,2,12,5,1,13,25,5055,4983,4969,4972,5054,5021,4993,5023)
AND m.mailbox_idnr = 125
AND m.message_idnr BETWEEN 5386865 AND 5647845
AND status < 2
GROUP By m.message_idnr, n.headername, v.headervalue
ORDER BY m.message_idnr, seq
] took [25] seconds

Using postgres explain confirms the query is okay:

explain (analyze, buffers) SELECT
m.message_idnr, n.headername, v.headervalue,
CASE
WHEN n.headername='date' THEN 0
WHEN n.headername='subject' THEN 1
WHEN n.headername='from' THEN 2
WHEN n.headername='content-type' THEN 3
WHEN n.headername='to' THEN 4
WHEN n.headername='cc' THEN 5
WHEN n.headername='bcc' THEN 6
WHEN n.headername='reply-to' THEN 7
WHEN n.headername='message-id' THEN 8
WHEN n.headername='references' THEN 9
WHEN n.headername='in-reply-to' THEN 10
WHEN n.headername='list-post' THEN 11
WHEN n.headername='list-unsubscribe' THEN 12
WHEN n.headername='sender' THEN 13
WHEN n.headername='x-k9mail-identity' THEN 14
WHEN n.headername='chat-version' THEN 15
WHEN TRUE THEN 1
ELSE 16
END AS seq
FROM dbmail_header h
LEFT JOIN dbmail_messages m ON h.physmessage_id=m.physmessage_id
LEFT JOIN dbmail_headername n ON h.headername_id=n.id
LEFT JOIN dbmail_headervalue v ON h.headervalue_id=v.id
WHERE h.headername_id IN (0,2,12,5,1,13,25,5055,4983,4969,4972,5054,5021,4993,5023)
AND m.mailbox_idnr = 125
AND m.message_idnr BETWEEN 5386865 AND 5647845
AND status < 2
GROUP By m.message_idnr, n.headername, v.headervalue
ORDER BY m.message_idnr, seq;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Incremental Sort  (cost=38755.87..46729.76 rows=43163 width=187) (actual time=32682.260..32725.517 rows=14404 loops=1)
   Sort Key: m.message_idnr, (CASE WHEN ((n.headername)::text = 'date'::text) THEN 0 WHEN ((n.headername)::text = 'subject'::text) THEN 1 WHEN ((n.headernam
e)::text = 'from'::text) THEN 2 WHEN ((n.headername)::text = 'content-type'::text) THEN 3 WHEN ((n.headername)::text = 'to'::text) THEN 4 WHEN ((n.headernam
e)::text = 'cc'::text) THEN 5 WHEN ((n.headername)::text = 'bcc'::text) THEN 6 WHEN ((n.headername)::text = 'reply-to'::text) THEN 7 WHEN ((n.headername)::t
ext = 'message-id'::text) THEN 8 WHEN ((n.headername)::text = 'references'::text) THEN 9 WHEN ((n.headername)::text = 'in-reply-to'::text) THEN 10 WHEN ((n.
headername)::text = 'list-post'::text) THEN 11 WHEN ((n.headername)::text = 'list-unsubscribe'::text) THEN 12 WHEN ((n.headername)::text = 'sender'::text) T
HEN 13 WHEN ((n.headername)::text = 'x-k9mail-identity'::text) THEN 14 WHEN ((n.headername)::text = 'chat-version'::text) THEN 15 ELSE 1 END) 
 Presorted Key: m.message_idn
 Full-sort Groups: 402  Sort Method: quicksort  Average Memory: 29kB  Peak Memory: 29kB
 Buffers: shared hit=61357 read=5400
 ->  Group  (cost=38755.20..45083.18 rows=43163 width=187) (actual time=32682.092..32709.455 rows=14404 loops=1)
       Group Key: m.message_idnr, n.headername, v.headervalue
       Buffers: shared hit=61357 read=5400
         ->  Gather Merge  (cost=38755.20..43086.88 rows=35970 width=183) (actual time=32682.078..32694.820 rows=14404 loops=1) 
             Workers Planned: 2
             Workers Launched: 2
             Buffers: shared hit=61357 read=5400
             ->  Group  (cost=37755.18..37935.03 rows=17985 width=183) (actual time=32647.786..32652.051 rows=4801 loops=3)
                   Group Key: m.message_idnr, n.headername, v.headervalue
                   Buffers: shared hit=61357 read=5400
                   ->  Sort  (cost=37755.18..37800.14 rows=17985 width=183) (actual time=32647.783..32648.770 rows=4801 loops=3)
                         Sort Key: m.message_idnr, n.headername, v.headervalue
                         Sort Method: quicksort  Memory: 803kB
                         Buffers: shared hit=61357 read=5400
                         Worker 0:  Sort Method: quicksort  Memory: 806kB
                         Worker 1:  Sort Method: quicksort  Memory: 769kB
                         ->  Nested Loop Left Join  (cost=563.56..36484.13 rows=17985 width=183) (actual time=180.734..32638.164 rows=4801 loops=3)
                               Buffers: shared hit=61325 read=5400
                               ->  Hash Left Join  (cost=563.13..26236.31 rows=17985 width=33) (actual time=180.667..5645.710 rows=4801 loops=3)
                                     Hash Cond: (h.headername_id = n.id)
                                     Buffers: shared hit=7597 read=1510
                                     ->  Nested Loop  (cost=559.86..26183.85 rows=17985 width=24) (actual time=180.367..5640.572 rows=4801 loops=3)
                                           Buffers: shared hit=7540 read=1510
                                             ->  Parallel Bitmap Heap Scan on dbmail_messages m  (cost=559.39..8439.14 rows=4488 width=16) (actual time=0.75
7..10.471 rows=533 loops=3)
                                                 Recheck Cond: (mailbox_idnr = 14)
                                                 Filter: ((message_idnr >= 5386865) AND (message_idnr <= 5647845) AND (status < 2))
                                                 Rows Removed by Filter: 7603
                                                 Heap Blocks: exact=1848
                                                 Buffers: shared hit=4115
                                                   ->  Bitmap Index Scan on dbmail_messages_1  (cost=0.00..556.70 rows=38970 width=0) (actual time=1.484..1.
484 rows=24425 loops=1)
                                                       Index Cond: (mailbox_idnr = 14)
                                                       Buffers: shared hit=61
                                             ->  Index Only Scan using dbmail_header_pkey on dbmail_header h  (cost=0.47..3.86 rows=9 width=24) (actual time
=10.266..10.551 rows=9 loops=1600)
                                                 Index Cond: (physmessage_id = m.physmessage_id)
                                                 Filter: (headername_id = ANY ('{0,2,12,5,1,13,25,5055,4983,4969,4972,5054,5021,4993,5023}'::bigint[]))
                                                 Rows Removed by Filter: 15
                                                 Heap Fetches: 356
                                                 Buffers: shared hit=3425 read=1510
                                     ->  Hash  (cost=2.01..2.01 rows=101 width=25) (actual time=0.135..0.136 rows=101 loops=3)
                                           Buckets: 1024  Batches: 1  Memory Usage: 14kB
                                           Buffers: shared hit=3
                                             ->  Seq Scan on dbmail_headername n  (cost=0.00..2.01 rows=101 width=25) (actual time=0.035..0.071 rows=101 loo
ps=3)
                                                 Buffers: shared hit=3
                                 ->  Index Scan using dbmail_headervalue_pkey on dbmail_headervalue v  (cost=0.43..0.57 rows=1 width=166) (actual time=5.620
..5.620 rows=1 loops=14404)
                                     Index Cond: (id = h.headervalue_id)
                                     Buffers: shared hit=53728 read=3890
Planning:
 Buffers: shared hit=57
Planning Time: 0.787 ms
Execution Time: 49.424 ms
(55 rows)