Monitoring: What You Don't Know Can Hurt You

In my last post, I mentioned network and application monitoring as one of those best practices that’s unfortunately not practiced as often as it should be. The importance of monitoring systems cannot be overstated. You want to know that your computers are functioning as you expect them to and that the applications running on them are also functional. Note that these two are only slightly related and correlated. True, if a computer has crashed, the applications running on it have also crashed. On the other hand, just because your hardware and operating system are running doesn’t mean that your applications are. This is the essential difference between network and application monitoring. I’ll come back to this point later.

If monitoring is so important, why doesn’t everybody do it? Well, in a sense they do, but the poorest practice is to rely on human monitoring (i.e. waiting for your customers to tell you your computers are down). Why doesn’t everyone implement automated monitoring systems? To consider the answer to this question, let’s review how these systems work.

There are various ways of classifying monitoring systems. One way to classify them that’s relevant to this discussion is based on whether the system is agent-based or agent-less.

In an agent-based system, special monitoring software is present on every computer and network device that is to be monitored. This monitoring agent evaluates the health of the computer/device and signals to the central monitoring software when something is out of kilter. Monitoring agents can sometimes also be queried by the central monitoring console in order to provide operating metrics, for example, performance data or resource availability data. Because it’s the agent that detects anomalies and informs the monitoring console, these systems can also be considered push type systems; the agent pushes the data to the console.

Agent-less systems do not require any special monitoring software on the computers and devices that are being monitored. Instead, the monitoring software uses pull mechanisms to evaluate the health of a monitored entity. These mechanisms might consists of low-level network probes, for example, pinging a device or higher level probes such as a specific HTTP request or an RPC call.

Agent-less systems are easier to implement, but agent-based systems are inherently more capable of evaluating system health as they have all operating system services at their disposal rather than just the ones accessible through external network means.

As a personal opinion, I also posit that agent-based systems are superior at hardware and OS monitoring whereas agent-less systems are ideal for application level monitoring. The former is typically more concerned about hardware and system services whereas the latter is concerned solely about whether applications are functional or not. How best to evaluate applications? Simulate their their use and evaluate the quality of their responses. Say you are monitoring a banking application. What better way to determine whether the application is running properly or not than by simulating a user, bringing up the bank web site, performing a transaction and checking your balances. Remember to use dummy accounts set up for this purpose.

There are some decent agent-less monitoring systems. Nagios, for example, supports numerous network probes that can be used in clever ways. Writing new probes is relatively easy, too. Nagios, by the way, can support both agent-based and agent-less monitoring. SiteScope, formerly from Mercury, now from HP, is also pretty cool.

As to agent-based monitoring, the pickins are much slimmer. The simplest agent-based systems are, naturally, based on the simple network monitoring protocol (SNMP). SNMP allows devices to “publish” a set of data that can be queried and displayed by monitoring consoles. Device manufacturers (SNMP is most heavily used by routers and other network gizmos) design a tree like structure of data called a management information base (a MIB). At each node in the tree is some datum that describes the operational health of the device. The manufacturer gets a magic number assigned to the company and each node in the MIB is identified by the company OID and a dotted sequence that describes the node’s position in the tree. SNMP-aware monitoring software, once informed of the device’s MIB, can query the device (using the SNMP protocol) to retrieve values for the various data nodes. SNMP also allows management software to write to SNMP addresses in order to configure devices. Finally, devices, having detected anomalies, can raise SNMP traps that can be “caught” by monitoring software.

The main drawback with SNMP is that it has a very poor security model. SNMPv3 (the latest incarnation) tries to address the security issue, but few devices support the new version. Without good security, SNMPv2 allows non-authorized users to view the operational status of a monitored device and to, perhaps, gain information that can be used to compromise it. Note too that devices that support configuration via SNMPv2 are vulnerable to being maliciously configured by non-authorized users.

While SNMP is frequently implemented in network hardware, it is also occasionally implemented in UNIX and UNIX-like computers and very occasionally on Windows machines.

Naturally, Windows computers are typically monitored using a different technique. Three of them, in fact. Sigh.

First, Windows computers support RPC. An administrator can tell if a Windows computer is healthy by connecting to it with a remote management console and looking at various data. The perfmon program, for example, can display graphs of Windows performance counters that measure available disk space, RAM and hundreds of other data.

Second, Windows computers support the Windows Management Instrumentation (WMI) protocol. WMI is a crude object oriented mechanism that allows Windows monitoring and management software to query system metrics, set system parameters and invoke management functions. WMI, by the way, is based on an IMTF standard known to the rest of the world as CIM or WBEM. Forget about the “standards” part – Microsoft WMI is not interoperable with anyone else’s implementation. The Microsoft Systems Center Operations Manager (MOM) folk had to implement their own WBEM code for Linux/UNIX in order to monitor these systems. The mechanism they implemented is actually the third monitoring technique that’s available on Windows, WS-Management or WS-Man as its frequently referred to.

WS-Man, like all of the WS-* protocols, is based on SOAP. A WS-Man aware monitoring program can read performance metrics and write configuration values by performing XML-based SOAP calls to a monitored device.

Although WS-Man seems like A Good Thing, especially since Microsoft is providing it on non-Windows platforms, I think it has several key flaws. First, WS-Man is based on both SOAP and WMI/CIM/WBEM. SOAP requires a considerable bit of glue in order to implement. In Windows, C# and .NET makes it pretty easy. On Unix, you can do it in C++ using Axis for example, you can do it in Java using Sun JWSDP or you can do it in Perl/Python or other SOAP aware scripting language. Each of these has its flaws. The C++ approach is error prone. The .NET or Java approaches require a huge runtime memory footprint. The Perl/Python approach is typeless requiring manual development of SOAP WSDL files instead of reflection-based synthesis. Beyond the SOAP issues, WMI/CIM/WBEM is simply butt-ugly (maybe even fugly). The technology had the misfortune of being designed at a time before Java and C# came into fruition. As a result, it’s extension mechanism is just clunky.

Beyond SNMP, RPC, WMI and WS-Man, there are yet other solutions. Companies that make monitoring software (for example, Microsoft, IBM, HP, BMC, Computer Associates, and others) frequently have their own proprietary monitoring agents that use yet other protocols.

Given all of these unattractive alternatives, it is not suprising that companies don’t diligently monitor all of their systems. The ones who do this best usually end up using a mashup of various mechanisms: SNMP for network hardware, Systems Center/MOM for their Windows systems, some Nagios for agent-less monitoring, toss in some HP OpenView in one or two divisions and some home grown stuff elsewhere.

What would Alan Turing do? Ack. I suppose WS-Man is better than the alternatives but I just can’t imagine Cisco adding all the necessary software to implement it.