I have written about system logging once or twice already. Earlier this month, I contrasted Windows and Unix event logs. A couple of posts ago, I mentioned logging as a “best practice” that’s frequently done poorly. In this post, I describe what I believe are currently accepted best practices for log collection, retention and analysis.
Log collection is not trivial. If done properly, the process should be:
- Useful – you should avoid logging unnecessary information
- Accurate – you should never lose log data due to network failures or traffic
- Secure – logs should be protected from malicious modification; log-related network traffic should be encrypted
- Automatic – manual operation will always be a weak link in any system
Log collection systems typically involve two or more levels of log collection and storage. First, each endpoint needs to perform useful, accurate, logging. On Windows systems, you should set your audit ACLs carefully to avoid unnecessary log events. You should assure that you have sufficient disk space for logging and that your event logs are configured to be large enough and to never overwrite themselves. On UNIX and kin, you need to look at your syslog configuration and see what you can do with it. All versions of syslog let you send different log “facilities” and different priority levels to different locations. You should configure syslog to direct the kern and auth facilities to a different location than the noisier, less important facilities (mail, lpr, news, etc.). You’ll have to consider how important user, daemon and others are to you. For some servers, especially those running mission critical applications that employ the user facility, you may want to consider that facility important as well and direct it to the same location as kern and auth. Next, you need to consider the priority levels. Usually, anything with priority lower than warning does not need to be retained. The objective with syslog is to identify the information that you want to retain and separate it from that which you don’t. I usually end up with three log files: one for important, retained, information (kern and auth output), one for less important information from kern and auth and one for all other information. The latter two categories aren’t retained centrally; they’re logged locally and managed with a log rotation mechanism. On most systems, log rotation can be performed by some versions of syslog or with a separate logrotate package.
The endpoint is the first level of log collection. The next levels typically take endpoint data and put it in a centralized database. In large installations, this may involve intermediate servers taking data from sets of endpoints and then writing the data to a single centralized database cluster.
There are various products that support centralized log collection. NetPro’s LogAdmin product offers both Windows and UNIX support. You can also use Microsoft’s System Center Operations Manager (MOM). For pure UNIX installations, some companies choose to make use of syslog’s server functionality. syslog, in addition to managing local logs, can also act as a server, gathering information from multiple machines. This approach, unfortunately, can suffer from several limitations. Standard syslog, for example, only supports unencrypted UDP-based communication. This network protocol is subject to both data loss and to poor security. Other versions of syslog such as syslog-ng support the TCP protocol which, coupled with stunnel, can provide encrypted, reliable communication.
Regardless of what collection scheme you use, you should also implement good retention policies. Log data tends to generate very large amounts of information. Understanding what you do and don’t need to have handy is important. Best practices suggest that you should maintain, in readily accessible form, the last 12 months of data. The highlights in the last sentence usually translate to “fully searchable, uncompressed data”. Information older than 12 months should still be stored, but it can be compressed to minimize disk space. In practice, what this means is that log retention systems typically store 12 months of data in a relational database but then compress old data and remove it from the database.
Note that old data still needs to be retained and that it might need to be searched after the 12 month lifetime. Log retention systems should allow old data to be reloaded in the database to facilitate processing.
The final “best practice” for logging is to implement log file analysis. There are several commercial products that perform sophisticated correlation tests and root-cause determination tests on log files. Again, Microsoft MOM can help here. Another choice would be Splunk. Splunk is interesting in that it focuses on searching instead of ambitious analysis.
Implementing an effective logging system seems like it should be straightforward but the reality is that it’s not. The standard tools built into operating systems are not sufficient to implement all stages of the logging process. Third-party tools are frequently needed in order to implement log retention and analysis.