Why correlating Postfix logs is hard
If you’ve ever tried to track a single email through Postfix by hand, you grep for the queue ID, find scattered lines across multiple daemons, parse the timestamps, and piece together what happened. It seems straightforward. It is not.
A single Postfix message is not recorded as one coherent entry. Instead, it is smeared across multiple syslog lines from different daemons (smtpd, cleanup, qmgr, smtp, lmtp, bounce), tied together only by a queue ID. To reconstruct “what happened to this message”, you must stitch those lines by queue ID, handle per-recipient status correctly, account for rotation and compression, parse timestamps that vary by syslog format, and potentially read logs from other hosts. This is the gap between a grep result and a delivery record.
This is also why generic log tooling falls short, and why a mail-delivery-specific solution exists.
Messages are not atomic
Section titled “Messages are not atomic”In Postfix, a message passes through several daemons before and after delivery. Each daemon logs something, and each log entry is independent:
- smtpd (listening daemon) records the inbound SMTP session: client address, TLS version, encryption cipher (if STARTTLS).
- cleanup receives the message content and headers, assigns it a queue ID, logs the message ID.
- qmgr (queue manager) takes the message out of the incoming queue, assigns recipient count (nrcpt), logs once at this handoff.
- smtp or lmtp (delivery agents) attempt delivery to each recipient, producing one log line per recipient per delivery attempt.
- bounce (the bounce daemon) generates and logs bounce notifications if delivery fails.
None of these daemons have shared state. Each writes to syslog independently. The glue that binds all these lines to one message is a single field: the queue ID.
Consider this real sequence:
Jun 17 09:12:01 mx postfix/smtpd[1234]: 9F2A1C0A2B: client=relay.example.com[203.0.113.5]Jun 17 09:12:01 mx postfix/cleanup[1240]: 9F2A1C0A2B: message-id=<20260617T0912.a1@example.com>Jun 17 09:12:01 mx postfix/qmgr[900]: 9F2A1C0A2B: from=<sender@example.com>, size=4096, nrcpt=2 (queue active)Jun 17 09:12:03 mx postfix/smtp[1250]: 9F2A1C0A2B: to=<a@acme.io>, relay=mx.acme.io[198.51.100.7]:25, delay=2.1, delays=0.2/0/0.5/1.4, dsn=2.0.0, status=sent (250 2.0.0 OK)Jun 17 09:12:05 mx postfix/smtp[1251]: 9F2A1C0A2B: to=<b@slow.net>, relay=mx.slow.net[198.51.100.9]:25, delay=4.0, delays=0.1/0/0.3/3.6, dsn=4.2.2, status=deferred (host mx.slow.net said: 452 4.2.2 Mailbox full)Jun 17 09:12:05 mx postfix/qmgr[900]: 9F2A1C0A2B: removedAll six lines describe a single message: queue ID 9F2A1C0A2B. The first three lines are metadata about the message itself. The fourth and fifth lines are per-recipient delivery attempts. The sixth line is the cleanup marker. Without explicit knowledge that a queue ID is the tie, these lines are just scattered events in a log file.
This is the core problem: grep finds lines, not messages.
Per-recipient status means a message can have multiple outcomes
Section titled “Per-recipient status means a message can have multiple outcomes”In the sequence above, the message has two recipients. One delivery succeeded (dsn=2.0.0, status=sent). One delivery deferred (dsn=4.2.2, status=deferred). The message as a whole is neither fully delivered nor fully failed; it is partially delivered and partially deferred.
This is not an edge case. It is the normal state of multirecipient mail. If an internal user sends to five external domains and one domain’s MX is down, the message sits in the queue (deferred for one recipient, sent for the others) until Postfix retries.
Generic dashboards that report “messages sent” vs. “messages deferred” must decide what to do with this message. Do you count it as sent? Deferred? Both? The answer depends on your use case. Postfix itself does not provide a single status for a multirecipient message; it provides per-recipient status via the status= field in the smtp/lmtp line.
To build an accurate delivery-health picture, you must:
- Parse every
to=<addr>andstatus=pair in the log. - Group them by queue ID.
- Compute a worst-case roll-up per message (bounced takes precedence over deferred, deferred takes precedence over sent) if you want a single metric.
- Or expose the per-recipient view so a human can see exactly which recipient was delivered, which deferred, which bounced.
A simple grep | wc -l counts lines, not messages or delivery outcomes. It cannot answer “how many of my messages to domain X were sent successfully?” without human interpretation.
What different tools give you
Section titled “What different tools give you”| Capability | grep + manual | pflogsumm | Grafana + Loki | Postfix Insights |
|---|---|---|---|---|
| Find one message by queue ID | Yes, manual parse | No | Possible (text search) | Yes |
| Per-recipient delivery status | No | Aggregated | Aggregated | Yes |
| DSN breakdown (5.1.1, 4.2.2, etc.) | Manual count | Aggregated | Possible (LogQL parsing) | Yes |
| Search by recipient or domain | No | Summary only | Possible (text search) | Yes |
| Handle log rotation/gzip | Manual extraction | Logrotate aware | Logrotate aware | Automatic |
| Remote logs (SSHFS support) | Copy files manually | No | Via agent | Yes |
| Self-hosted, no external service | Yes | Yes | Yes | Yes |
| Real-time SLA and bounce rate | No | Manual scripts | Yes | Yes |
This table illustrates the tradeoff: grep requires significant manual effort per query. Metrics stacks (Prometheus/Grafana) excel at time-series aggregation but are overkill for point lookups and lack mail-specific parsing. Purpose-built tools eliminate the correlation step.
Queue IDs have a variable length
Section titled “Queue IDs have a variable length”Postfix normally uses short queue IDs (e.g., 9F2A1C0A2B, 10 hex characters). The enable_long_queue_ids parameter, when set to yes, changes the queue ID encoding to alphanumeric (both letters and digits) and increases the length. Some installs may have both short and long IDs in the same log file if the parameter was changed mid-run.
Any parser that assumes a fixed queue ID length or character set will break. You must parse them as variable-length alphanumeric strings.
Status codes are not the SMTP reply code
Section titled “Status codes are not the SMTP reply code”The dsn= field in the smtp/lmtp log line contains an Enhanced Mail System Status Code, defined by RFC 3463. This is a three-part code: class.subject.detail, for example 2.0.0 (success), 4.2.2 (temporary failure, mailbox full), or 5.1.1 (permanent failure, bad destination). These codes provide a structured, programmatic way to categorize delivery outcomes across different SMTP servers; their format and semantics are defined in RFC 3463. When delivery fails, the bounce notification that carries the code uses the DSN message format defined in RFC 3464 (Delivery Status Notifications).
The SMTP reply code (e.g., 250, 452, 550, defined in RFC 5321) is logged separately in the message text part of the smtp/lmtp line (e.g., (250 2.0.0 OK)). The SMTP code is issued by the remote server and is subject to variation and abuse; the DSN code is what Postfix standardizes in the dsn= field.
If you want to categorize bounces by reason (e.g., “bad email address” vs. “mailbox full” vs. “domain does not exist”), the DSN code is your reliable source. The SMTP code is context and should not be parsed as a first-line signal.
Timestamps can vary by syslog format
Section titled “Timestamps can vary by syslog format”Postfix logs go to syslog, and syslog format is not monolithic. Classic BSD syslog format uses Jan 17 09:12:01 (month, day, time, no year or timezone). RFC 3339 format adds microseconds, timezone, and explicit year. On many Postfix deployments, logs are parsed by a log aggregation daemon (rsyslog, syslog-ng) that may rewrite timestamps or add timezone information. A log forwarded over the network may arrive with a different timestamp than it was written. Logs rotated and gzipped by logrotate carry a file mtime, not an in-file timestamp.
If you correlate by queue ID within a single host’s mail.log, timestamps are less critical (you can use mtime and line order). But if you correlate logs from multiple daemons or multiple hosts, you must be prepared to parse multiple syslog dialects and handle timezone offsets.
Logs rotate, compress, and are not queryable after rotation
Section titled “Logs rotate, compress, and are not queryable after rotation”Postfix logs go to /var/log/mail.log (or a site-specific path). On most systems, logrotate compresses old logs: mail.log becomes mail.log.1.gz, mail.log.2.gz, etc. A message with a delivery delay of several days may have log lines in three or four compressed files.
A typical grep approach requires you to extract all the gzipped files, concatenate them, and search. For remote logs (logs on another host), you must copy them over. For a busy mail server, the log directory can consume gigabytes of disk space.
Additionally, once a log is rotated and no longer the current mail.log, it is no longer updated. If you want to correlate a message with a delivery delay spanning the rotation boundary, you must query multiple files.
TLS and DKIM logging is incomplete
Section titled “TLS and DKIM logging is incomplete”Postfix logs the TLS handshake directly. The smtpd daemon logs the TLS version and cipher suite for the inbound SMTP session. The smtp/lmtp daemon logs the TLS version and cipher suite for outbound deliveries (RFC 3207 for STARTTLS, RFC 5246 for TLS 1.2, RFC 8446 for TLS 1.3). This information is scattered across smtpd lines (for inbound) and smtp lines (for outbound). RFC 8314 clarifies that implicit TLS is preferred over STARTTLS.
Postfix itself does not sign or verify DKIM signatures. DKIM (RFC 6376) signing and verification are handled by a milter (mail filter) such as OpenDKIM. The DKIM signing result surfaces in milter-level logs and the Authentication-Results header, not in Postfix core logs. To build a “DKIM signing rate” metric, you would need to parse the DKIM-Signature header in delivered messages or correlate the log with the milter logs. To read the Authentication-Results and DKIM-Signature headers on an individual message, paste them into the Email Header Analyzer.
How Postfix Insights handles this
Section titled “How Postfix Insights handles this”Postfix Insights correlates messages by queue ID, groups all log lines for that ID, and exposes per-recipient status so you can see exactly which recipient was sent, deferred, or bounced. It handles log rotation and gzip transparently, reads logs locally or over SSHFS, and parses multiple syslog formats. The search surface lets you find messages by recipient, domain, or subject, with the raw log lines and structured delivery outcome visible. The delivery-health dashboard aggregates per-recipient outcomes to compute volume, bounce rate, defer rate, SLA, and DSN breakdown.
To get started, see the quick-start guide. To explore the project or contribute, visit the GitHub repository.
References
Section titled “References”- Postfix documentation: https://www.postfix.org/documentation.html
- Postfix
enable_long_queue_idsparameter: https://www.postfix.org/postconf.5.html#enable_long_queue_ids - RFC 3207 (SMTP Service Extension for Secure SMTP over Transport Layer Security): https://www.rfc-editor.org/rfc/rfc3207
- RFC 3339 (Date and Time on the Internet: Timestamps): https://www.rfc-editor.org/rfc/rfc3339
- RFC 3463 (Enhanced Mail System Status Codes): https://www.rfc-editor.org/rfc/rfc3463
- RFC 3464 (Delivery Status Notifications): https://www.rfc-editor.org/rfc/rfc3464
- RFC 5246 (The TLS Protocol Version 1.2): https://www.rfc-editor.org/rfc/rfc5246
- RFC 5321 (Simple Mail Transfer Protocol): https://www.rfc-editor.org/rfc/rfc5321
- RFC 6376 (DKIM Signatures): https://www.rfc-editor.org/rfc/rfc6376
- RFC 8314 (Cleartext Considered Obsolete: Use of TLS for Email Submission and Access): https://www.rfc-editor.org/rfc/rfc8314
- RFC 8446 (The TLS Protocol Version 1.3): https://www.rfc-editor.org/rfc/rfc8446
- OpenDKIM: http://www.opendkim.org/