Archive for June, 2008

Best Practices: Logging

Wednesday, June 25th, 2008

I have written about system logging once or twice already. Earlier this month, I contrasted Windows and Unix event logs. A couple of posts ago, I mentioned logging as a “best practice” that’s frequently done poorly. In this post, I describe what I believe are currently accepted best practices for log collection, retention and analysis.

Log collection is not trivial. If done properly, the process should be:

  1. Useful - you should avoid logging unnecessary information
  2. Accurate - you should never lose log data due to network failures or traffic
  3. Secure - logs should be protected from malicious modification; log-related network traffic should be encrypted
  4. Automatic - manual operation will always be a weak link in any system

Log collection systems typically involve two or more levels of log collection and storage. First, each endpoint needs to perform useful, accurate, logging. On Windows systems, you should set your audit ACLs carefully to avoid unnecessary log events. You should assure that you have sufficient disk space for logging and that your event logs are configured to be large enough and to never overwrite themselves. On UNIX and kin, you need to look at your syslog configuration and see what you can do with it. All versions of syslog let you send different log “facilities” and different priority levels to different locations. You should configure syslog to direct the kern and auth facilities to a different location than the noisier, less important facilities (mail, lpr, news, etc.). You’ll have to consider how important user, daemon and others are to you. For some servers, especially those running mission critical applications that employ the user facility, you may want to consider that facility important as well and direct it to the same location as kern and auth. Next, you need to consider the priority levels. Usually, anything with priority lower than warning does not need to be retained.  The objective with syslog is to identify the information that you want to retain and separate it from that which you don’t. I usually end up with three log files: one for important, retained, information (kern and auth output), one for less important information from kern and auth and one for all other information. The latter two categories aren’t retained centrally; they’re logged locally and managed with a log rotation mechanism. On most systems, log rotation can be performed by some versions of syslog or with a separate logrotate package.

The endpoint is the first level of log collection. The next levels typically take endpoint data and put it in a centralized database. In large installations, this may involve intermediate servers taking data from sets of endpoints and then writing the data to a single centralized database cluster.

There are various products that support centralized log collection. NetPro’s LogAdmin product offers both Windows and UNIX support. You can also use Microsoft’s System Center Operations Manager (MOM). For pure UNIX installations, some companies choose to make use of syslog’s server functionality. syslog, in addition to managing local logs, can also act as a server, gathering information from multiple machines. This approach, unfortunately, can suffer from several limitations. Standard syslog, for example, only supports unencrypted UDP-based communication. This network protocol is subject to both data loss and to poor security. Other versions of syslog such as syslog-ng support the TCP protocol which, coupled with stunnel, can provide encrypted, reliable communication.

Regardless of what collection scheme you use, you should also implement good retention policies. Log data tends to generate very large amounts of information. Understanding what you do and don’t need to have handy is important. Best practices suggest that you should maintain, in readily accessible form, the last 12 months of data. The highlights in the last sentence usually translate to “fully searchable, uncompressed data”. Information older than 12 months should still be stored, but it can be compressed to minimize disk space. In practice, what this means is that log retention systems typically store 12 months of data in a relational database but then compress old data and remove it from the database.

Note that old data still needs to be retained and that it might need to be searched after the 12 month lifetime. Log retention systems should allow old data to be reloaded in the database to facilitate processing.

The final “best practice” for logging is to implement log file analysis. There are several commercial products that perform sophisticated correlation tests and root-cause determination tests on log files. Again, Microsoft MOM can help here. Another choice would be Splunk. Splunk is interesting in that it focuses on searching instead of ambitious analysis.

Implementing an effective logging system seems like it should be straightforward but the reality is that it’s not. The standard tools built into operating systems are not sufficient to implement all stages of the logging process. Third-party tools are frequently needed in order to implement log retention and analysis.

OS Support for Auditing

Tuesday, June 24th, 2008

I spent most of last week in Boston, attending the Red Hat Summit. This is an annual show put on by Red Hat to discuss various topics of interest to their customers, partners and analysis. I was there because Likewise is a Red Hat partner and because we received an Innovator of the Year award.

I sat in on various sessions. One of the most interesting ones was an update on changes to the auditing system in the Linux kernel and in the associated Red Hat tools.

What is auditing? Auditing is the ability to detect access to specific resources and to note who performed the access and when the access was performed. Examples of this are auditing when someone reads from or writes to the /etc/passwd  in UNIX or when someone tries to add a member to the Domain Admins group in Windows Active Directory. In both of these cases, there might be completely legitimate reasons for such access. Auditing is about collecting information, not about enforcing security. Of course, if security is violated, auditing information can provide forensic evidence to discover whodunnit and what was dun.

The auditing system in Red Hat and others versions of Linux is implemented by a combination of kernel modifications and code in the auditd daemon. Auditd reads configuration files and informs the kernel what information it’s interested in. When relevant events are detected in the kernel, it signals back to auditd that, in turn, generates log information and/or passes information to other interested parties.

Auditd provides a rich set of things that can be audited. For example, any system call can be logged as can access to any file.

Configuring auditd is not trivial. Here’s a link to a page that describes how it’s done. Additionally, if you want to read audit output, it can be messy, too.

As with most things UNIX-y, there’s little consistency between versions. Linux offers one thing. Solaris, HPUX and AIX all offer something different (from Linux and from each other!). They all do more or less the same thing.

And then there’s Windows.

Windows takes a fundamentally different approach to auditing. In Windows, the auditing mechanism is almost identical to its security mechanism. Security is implemented by using access control lists (ACLs). So is auditing.

Windows supports two types of ACLs: discretionary ACLs (DACLs) and system ACLs (SACLs). The former is used for security, the latter for auditing. Just as Windows uses a DACL to determine if a user has access to a resource, it uses the SACL to determine whether or not the access should be logged. Log output is sent to event logs.

The ACL approach means that Windows can audit access to any resource that has ACLs associated with it. You can audit access to a file (or to entire directories since ACLs are “inherited”). You can also audit access to registry entries or nodes in Active Directory. Of course, things without ACLs cannot be audited. You can’t audit when an application calls a specific system call, for example.

OS auditing is a very powerful feature that can give you assurances that your security systems are working effectively and that your privileged users are not unnecessarily accessing restricted data. Even in Windows, however, where configuring the audit system is easy, I think the features are underutilized. In UNIX/Linux/Mac I suspect they’re used by fewer than 5% of users.

Monitoring: What You Don’t Know Can Hurt You

Monday, June 23rd, 2008

In my last post, I mentioned network and application monitoring as one of those best practices that’s unfortunately not practiced as often as it should be. The importance of monitoring systems cannot be overstated. You want to know that your computers are functioning as you expect them to and that the applications running on them are also functional. Note that these two are only slightly related and correlated. True, if a computer has crashed, the applications running on it have also crashed. On the other hand, just because your hardware and operating system are running doesn’t mean that your applications are. This is the essential difference between network and application monitoring. I’ll come back to this point later.

If monitoring is so important, why doesn’t everybody do it? Well, in a sense they do, but the poorest practice is to rely on human monitoring (i.e. waiting for your customers to tell you your computers are down). Why doesn’t everyone implement automated monitoring systems? To consider the answer to this question, let’s review how these systems work.

There are various ways of classifying monitoring systems. One way to classify them that’s relevant to this discussion is based on whether the system is agent-based or agent-less.

In an agent-based system, special monitoring software is present on every computer and network device that is to be monitored. This monitoring agent evaluates the health of the computer/device and signals to the central monitoring software when something is out of kilter. Monitoring agents can sometimes also be queried by the central monitoring console in order to provide operating metrics, for example, performance data or resource availability data. Because it’s the agent that detects anomalies and informs the monitoring console, these systems can also be considered push type systems; the agent pushes the data to the console.

Agent-less systems do not require any special monitoring software on the computers and devices that are being monitored. Instead, the monitoring software uses pull mechanisms to evaluate the health of a monitored entity. These mechanisms might consists of low-level network probes, for example, pinging a device or higher level probes such as a specific HTTP request or an RPC call.

Agent-less systems are easier to implement, but agent-based systems are inherently more capable of evaluating system health as they have all operating system services at their disposal rather than just the ones accessible through external network means.

As a personal opinion, I also posit that agent-based systems are superior at hardware and OS monitoring whereas agent-less systems are ideal for application level monitoring. The former is typically more concerned about hardware and system services whereas the latter is concerned solely about whether applications are functional or not. How best to evaluate applications? Simulate their their use and evaluate the quality of their responses. Say you are monitoring a banking application. What better way to determine whether the application is running properly or not than by simulating a user, bringing up the bank web site, performing a transaction and checking your balances. Remember to use dummy accounts set up for this purpose.

There are some decent agent-less monitoring systems. Nagios, for example, supports numerous network probes that can be used in clever ways. Writing new probes is relatively easy, too. Nagios, by the way, can support both agent-based and agent-less monitoring. SiteScope, formerly from Mercury, now from HP, is also pretty cool.

As to agent-based monitoring, the pickins are much slimmer. The simplest agent-based systems are, naturally, based on the simple network monitoring protocol (SNMP). SNMP allows devices to “publish” a set of data that can be queried and displayed by monitoring consoles. Device manufacturers (SNMP is most heavily used by routers and other network gizmos) design a tree like structure of data called a management information base (a MIB). At each node in the tree is some datum that describes the operational health of the device. The manufacturer gets a magic number assigned to the company and each node in the MIB is identified by the company OID and a dotted sequence that describes the node’s position in the tree. SNMP-aware monitoring software, once informed of the device’s MIB, can query the device (using the SNMP protocol) to retrieve values for the various data nodes. SNMP also allows management software to write to SNMP addresses in order to configure devices. Finally, devices, having detected anomalies, can raise SNMP traps that can be “caught” by monitoring software.

The main drawback with SNMP is that it has a very poor security model. SNMPv3 (the latest incarnation) tries to address the security issue, but few devices support the new version. Without good security, SNMPv2 allows non-authorized users to view the operational status of a monitored device and to, perhaps, gain information that can be used to compromise it. Note too that devices that support configuration via SNMPv2 are vulnerable to being maliciously configured by non-authorized users.

While SNMP is frequently implemented in network hardware, it is also occasionally implemented in UNIX and UNIX-like computers and very occasionally on Windows machines.

Naturally, Windows computers are typically monitored using a different technique. Three of them, in fact. Sigh.

First, Windows computers support RPC. An administrator can tell if a Windows computer is healthy by connecting to it with a remote management console and looking at various data. The perfmon program, for example, can display graphs of Windows performance counters that measure available disk space, RAM and hundreds of other data.

Second, Windows computers support the Windows Management Instrumentation (WMI) protocol. WMI is a crude object oriented mechanism that allows Windows monitoring and management software to query system metrics, set system parameters and invoke management functions. WMI, by the way, is based on an IMTF standard known to the rest of the world as CIM or WBEM. Forget about the “standards” part - Microsoft WMI is not interoperable with anyone else’s implementation. The Microsoft Systems Center Operations Manager (MOM) folk had to implement their own WBEM code for Linux/UNIX in order to monitor these systems. The mechanism they implemented is actually the third monitoring technique that’s available on Windows, WS-Management or WS-Man as its frequently referred to.

WS-Man, like all of the WS-* protocols, is based on SOAP. A WS-Man aware monitoring program can read performance metrics and write configuration values by performing XML-based SOAP calls to a monitored device.

Although WS-Man seems like A Good Thing, especially since Microsoft is providing it on non-Windows platforms, I think it has several key flaws. First, WS-Man is based on both SOAP and WMI/CIM/WBEM. SOAP requires a considerable bit of glue in order to implement. In Windows, C# and .NET makes it pretty easy. On Unix, you can do it in C++ using Axis for example, you can do it in Java using Sun JWSDP or you can do it in Perl/Python or other SOAP aware scripting language. Each of these has its flaws. The C++ approach is error prone. The .NET or Java approaches require a huge runtime memory footprint. The Perl/Python approach is typeless requiring manual development of SOAP WSDL files instead of reflection-based synthesis. Beyond the SOAP issues, WMI/CIM/WBEM is simply butt-ugly (maybe even fugly). The technology had the misfortune of being designed at a time before Java and C# came into fruition. As a result, it’s extension mechanism is just clunky.

Beyond SNMP, RPC, WMI and WS-Man, there are yet other solutions. Companies that make monitoring software (for example, Microsoft, IBM, HP, BMC, Computer Associates, and others) frequently have their own proprietary monitoring agents that use yet other protocols.

Given all of these unattractive alternatives, it is not suprising that companies don’t diligently monitor all of their systems. The ones who do this best usually end up using a mashup of various mechanisms: SNMP for network hardware, Systems Center/MOM for their Windows systems, some Nagios for agent-less monitoring, toss in some HP OpenView in one or two divisions and some home grown stuff elsewhere.

What would Alan Turing do? Ack. I suppose WS-Man is better than the alternatives but I just can’t imagine Cisco adding all the necessary software to implement it.

Best Practices vs. Practical Reality

Sunday, June 22nd, 2008

I’m struck by the huge gap that I see between acknowledged best practices and what companies are actually practicing. More than once, I have researched a product direction and decided not to pursue it because it seemed to me that the market must already be saturated. Later, when talking to customers, I’m stunned to find that hardly any of them have bought or are using the product in question. How can a market support a dozen companies when none of them seem to have any market share?

If you are a regular reader of eWeek, Infoworld, CIO Insight and others you might be lead to believe that every IT department is:

  • Heavily using virtualization products
  • Using comprehensive network and application monitoring tools
  • Diligently practicing strong security techniques
  • Maintaining audit logs and performing correlation analysis on them
  • Faithfully practicing ITIL techniques

Now, I’m not in Sales, but I’ve been to a lot of sales calls. I’ve probably talked to 100 companies over the last year. Of these, I can count on zero fingers the number of them that are practicing all of things mentioned above. On the other hand, the companies that are practicing none of the above is definitely non-zero!

In most cases, when I ask companies about the items I listed above, they sheepishly admit ot their failings. They know they should be doing these things. In some cases, they’ve even already paid for necessary software but have yet to deploy it. There is some tremendously successful shelfware in the industry.

What to make of all this?

  1. Don’t shy away from markets that seem to be crowded. There is still plenty of “whitespace” in the market where clever products and good companies can succeed.
  2. Don’t assume you can’t compete against software that’s been available for many years. I believe that there’s a lot of enterprise software that suffers from having been written 5 or 10 years ago using brittle programming techniques. A company with a strong engineering team can quickly develop a competing product using modern tools and techniques.
  3. Even good ideas and good products can take a good long time before they’re commonplace in IT. Certainly, most IT departments have good backup/restore infrastructure and good disaster recovery plans. It’s probably taken 10-20 years to make these pervasive practices.
  4. Read the journals, but talk to customers. The rags are way too preoccupied with what the top 5% of the IT innovators are doing. They are much less representative of the 1900 companies in the Fortune 2000.

There’s one other aspect of the problem that I’m still digesting. If there are a dozen competitors in a market, all doing the same thing and none of them are succeeding maybe the solution is to do something else. I’m reminded of a story I heard while taking a quality control course years ago. The story takes place in WWII and describes how an aircraft manufacturer decided on what parts of its planes needed extra armor. It had an engineer studying aircraft returning from combat. The company would look at bullet holes on the planes and add armor where there were no bullet holes. Why? Obviously, the planes that got shot in those places were the ones that didn’t survive the battle.

This is obvious once it’s explained but it requires appreciating what artists call “negative space”. If you don’t know what this is, look for the arrow in the FedEx logo.

Virtual Directories

Saturday, June 21st, 2008

I’ve talked to a couple of companies now that sell virtual directory products.  Most recently, I talked to Identyx, a company recently bought by Red Hat to enhance their directory product offerings. A virtual directory is software that looks like a directory (typically, an LDAP one) but doesn’t actually store any data. Whenever a request comes in, the virtual directory retrieves the requested data from one of several configured data sources. A virtual directory can “front” an existing LDAP directory but it can also make data in relational databases, flat files or other sources appear to exist in an LDAP directory.

I think this is a pretty cool concept. Implementing a single, comprehensive, directory is at best difficult and, at worst, impossible. Companies frequently have data in multiple repositories. A virtual directory allows this data to appear to be in a comprehensive directory while actually remaining in their native stores. A virtual directory can also simply synchronization of data across repositories. Adding an object to a virtual directory can implicitly require the addition of data to the constituent repositories. Modifications to a datum might actually result in modifications to multiple repositories that contain the duplicated datum.

The biggest challenge with virtual directories is trying to retrofit them to applications that are currently directly wired to the constituent data sources. In the case of an application that reads database information by using JDBC/ODBC, it might be impossible to change it to using LDAP for its data access. Note that some virtual directories, however, can provide multiple access interfaces, for example, both LDAP and SQL. Even in the case of applications that currently use LDAP, it can be a challenge for a virtual directory to completely mimic a constituent LDAP repository. If the application analyzes the directory schema for example (something that’s possible with Microsoft Active Directory), the virtual directory either has to synthesize a comprehensive schema including data from other sources or it has to “lie” and only deliver the schema elements in the constituent repository. The former approach can confuse applications that expect a specific schema (for example, “Microsoft schema revision 31″) while the latter approach can confuse applications that use schema information to drive their operation. 

LDAP security can also be difficult to emulate/synthesize. If an Active Directory-aware program controls security by manipulating AD access control lists (ACLs), the virtual directory might need to synthesize objectSecurity attributes for objects that lie in repositories that don’t normally support ACLs and then reflect any changes back into the constituent stores. This might be difficult. Placing record-level ACLs on database rows, for example, might not be something that is supported by a constituent data store. In this case, the virtual directory might need to store its own parallel information.

Virtual directories can also be slow. The whole point of LDAP is to be fast (unlike the original X.500 directories that no one actually uses). If, however, data is actually coming from a slower store, fulfilling an LDAP request will be slow, too. For this reason, virtual directories need to perform intelligent caching.

Once virtual directories start storing their caches, they become a hybrid of sorts. They’re virtual directories but they can also behave like meta directories, too. I’ll write about those in some other post.