Structured log files

‹ Erlang Cheat Sheet | Structured log files in Python using python-json-logger ›

Historically, logs on unix systems have been line-based. Each event is represented by a single line appended to a file. Fields of the event are separated by spaces. These logs are compact, human readable, somewhat machine readable, and easily analyzed with pipelines of command line tools like cut, sort, uniq, awk, sed etc.

Jun 30 21:15:57 personal rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="962" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jun 30 21:16:10 personal anacron[1495]: Job `cron.daily' terminated
Jun 30 21:16:10 personal anacron[1495]: Normal exit (1 job run)
Jun 30 21:16:12 personal tpvmlpd2[4066]: readlink("/dev/vmwcomgw") failed with 2
Jun 30 21:16:27 personal tpvmlpd2[4110]: readlink("/dev/vmwcomgw") failed with 2
Jun 30 21:16:27 personal tpvmlpd2[3557]: aborting
Jun 30 21:17:01 personal CRON[4113]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 30 21:18:59 personal NetworkManager[1325]: <info> kernel firmware directory '/lib/firmware' changed
Jun 30 21:19:32 personal kernel: [ 539.039374] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled

These days, log indexing tools like logstash, splunk, etc store each event as a record instead of a string. Instead of assuming that each entry is always has these fields in this order: "DATE HOSTNAME SOURCE: MESSAGE" entries can contain any number of fields, and each field is explicitly named, like:

DATE: Jun 30 21:19:32
SOURCE: NetworkManager
PROCESS_ID: 1325
LEVEL: INFO
MESSAGE: kernel firmware directory '/lib/firmware' changed

This representation makes it easier to extend the record while keeping backward compatibility with older events. It allows queries to be expressed in a more human-readable format, like if you want to search for events where SOURCE=NetworkManager. It lets you combine logs from multiple sources, as long as all of the sources agree on the field names. However, this approach often sacrifices human readability of the log file itself and the ability of shell pipelines to analyze them.

It would be nice to get all of the benefits of both -- the human readability of text files combined with the flexibility of indexing tools.

One approach is to parse classical line-based logs into a record-based indexer, but the parser must be programmed to understand the order of the fields within each line which makes it difficult to extend or change the log file format later.

One other common approach is to log events to the disk using JSON. Although this is more verbose, it is still (barely) human readable and easily parsed into a record. In another post, I will explore using python-json-logger to output JSON logs in Python.

Subscribe to All Posts - Wesley Tanaka