Archive for June, 2008

OS Support for Auditing

Tuesday, June 24th, 2008

I spent most of last week in Boston, attending the Red Hat Summit. This is an annual show put on by Red Hat to discuss various topics of interest to their customers, partners and analysis. I was there because Likewise is a Red Hat partner and because we received an Innovator of the Year award.

I sat in on various sessions. One of the most interesting ones was an update on changes to the auditing system in the Linux kernel and in the associated Red Hat tools.

What is auditing? Auditing is the ability to detect access to specific resources and to note who performed the access and when the access was performed. Examples of this are auditing when someone reads from or writes to the /etc/passwd  in UNIX or when someone tries to add a member to the Domain Admins group in Windows Active Directory. In both of these cases, there might be completely legitimate reasons for such access. Auditing is about collecting information, not about enforcing security. Of course, if security is violated, auditing information can provide forensic evidence to discover whodunnit and what was dun.

The auditing system in Red Hat and others versions of Linux is implemented by a combination of kernel modifications and code in the auditd daemon. Auditd reads configuration files and informs the kernel what information it’s interested in. When relevant events are detected in the kernel, it signals back to auditd that, in turn, generates log information and/or passes information to other interested parties.

Auditd provides a rich set of things that can be audited. For example, any system call can be logged as can access to any file.

Configuring auditd is not trivial. Here’s a link to a page that describes how it’s done. Additionally, if you want to read audit output, it can be messy, too.

As with most things UNIX-y, there’s little consistency between versions. Linux offers one thing. Solaris, HPUX and AIX all offer something different (from Linux and from each other!). They all do more or less the same thing.

And then there’s Windows.

Windows takes a fundamentally different approach to auditing. In Windows, the auditing mechanism is almost identical to its security mechanism. Security is implemented by using access control lists (ACLs). So is auditing.

Windows supports two types of ACLs: discretionary ACLs (DACLs) and system ACLs (SACLs). The former is used for security, the latter for auditing. Just as Windows uses a DACL to determine if a user has access to a resource, it uses the SACL to determine whether or not the access should be logged. Log output is sent to event logs.

The ACL approach means that Windows can audit access to any resource that has ACLs associated with it. You can audit access to a file (or to entire directories since ACLs are “inherited”). You can also audit access to registry entries or nodes in Active Directory. Of course, things without ACLs cannot be audited. You can’t audit when an application calls a specific system call, for example.

OS auditing is a very powerful feature that can give you assurances that your security systems are working effectively and that your privileged users are not unnecessarily accessing restricted data. Even in Windows, however, where configuring the audit system is easy, I think the features are underutilized. In UNIX/Linux/Mac I suspect they’re used by fewer than 5% of users.

Monitoring: What You Don't Know Can Hurt You

Monday, June 23rd, 2008

In my last post, I mentioned network and application monitoring as one of those best practices that’s unfortunately not practiced as often as it should be. The importance of monitoring systems cannot be overstated. You want to know that your computers are functioning as you expect them to and that the applications running on them are also functional. Note that these two are only slightly related and correlated. True, if a computer has crashed, the applications running on it have also crashed. On the other hand, just because your hardware and operating system are running doesn’t mean that your applications are. This is the essential difference between network and application monitoring. I’ll come back to this point later.

If monitoring is so important, why doesn’t everybody do it? Well, in a sense they do, but the poorest practice is to rely on human monitoring (i.e. waiting for your customers to tell you your computers are down). Why doesn’t everyone implement automated monitoring systems? To consider the answer to this question, let’s review how these systems work.

There are various ways of classifying monitoring systems. One way to classify them that’s relevant to this discussion is based on whether the system is agent-based or agent-less.

In an agent-based system, special monitoring software is present on every computer and network device that is to be monitored. This monitoring agent evaluates the health of the computer/device and signals to the central monitoring software when something is out of kilter. Monitoring agents can sometimes also be queried by the central monitoring console in order to provide operating metrics, for example, performance data or resource availability data. Because it’s the agent that detects anomalies and informs the monitoring console, these systems can also be considered push type systems; the agent pushes the data to the console.

Agent-less systems do not require any special monitoring software on the computers and devices that are being monitored. Instead, the monitoring software uses pull mechanisms to evaluate the health of a monitored entity. These mechanisms might consists of low-level network probes, for example, pinging a device or higher level probes such as a specific HTTP request or an RPC call.

Agent-less systems are easier to implement, but agent-based systems are inherently more capable of evaluating system health as they have all operating system services at their disposal rather than just the ones accessible through external network means.

As a personal opinion, I also posit that agent-based systems are superior at hardware and OS monitoring whereas agent-less systems are ideal for application level monitoring. The former is typically more concerned about hardware and system services whereas the latter is concerned solely about whether applications are functional or not. How best to evaluate applications? Simulate their their use and evaluate the quality of their responses. Say you are monitoring a banking application. What better way to determine whether the application is running properly or not than by simulating a user, bringing up the bank web site, performing a transaction and checking your balances. Remember to use dummy accounts set up for this purpose.

There are some decent agent-less monitoring systems. Nagios, for example, supports numerous network probes that can be used in clever ways. Writing new probes is relatively easy, too. Nagios, by the way, can support both agent-based and agent-less monitoring. SiteScope, formerly from Mercury, now from HP, is also pretty cool.

As to agent-based monitoring, the pickins are much slimmer. The simplest agent-based systems are, naturally, based on the simple network monitoring protocol (SNMP). SNMP allows devices to “publish” a set of data that can be queried and displayed by monitoring consoles. Device manufacturers (SNMP is most heavily used by routers and other network gizmos) design a tree like structure of data called a management information base (a MIB). At each node in the tree is some datum that describes the operational health of the device. The manufacturer gets a magic number assigned to the company and each node in the MIB is identified by the company OID and a dotted sequence that describes the node’s position in the tree. SNMP-aware monitoring software, once informed of the device’s MIB, can query the device (using the SNMP protocol) to retrieve values for the various data nodes. SNMP also allows management software to write to SNMP addresses in order to configure devices. Finally, devices, having detected anomalies, can raise SNMP traps that can be “caught” by monitoring software.

The main drawback with SNMP is that it has a very poor security model. SNMPv3 (the latest incarnation) tries to address the security issue, but few devices support the new version. Without good security, SNMPv2 allows non-authorized users to view the operational status of a monitored device and to, perhaps, gain information that can be used to compromise it. Note too that devices that support configuration via SNMPv2 are vulnerable to being maliciously configured by non-authorized users.

While SNMP is frequently implemented in network hardware, it is also occasionally implemented in UNIX and UNIX-like computers and very occasionally on Windows machines.

Naturally, Windows computers are typically monitored using a different technique. Three of them, in fact. Sigh.

First, Windows computers support RPC. An administrator can tell if a Windows computer is healthy by connecting to it with a remote management console and looking at various data. The perfmon program, for example, can display graphs of Windows performance counters that measure available disk space, RAM and hundreds of other data.

Second, Windows computers support the Windows Management Instrumentation (WMI) protocol. WMI is a crude object oriented mechanism that allows Windows monitoring and management software to query system metrics, set system parameters and invoke management functions. WMI, by the way, is based on an IMTF standard known to the rest of the world as CIM or WBEM. Forget about the “standards” part – Microsoft WMI is not interoperable with anyone else’s implementation. The Microsoft Systems Center Operations Manager (MOM) folk had to implement their own WBEM code for Linux/UNIX in order to monitor these systems. The mechanism they implemented is actually the third monitoring technique that’s available on Windows, WS-Management or WS-Man as its frequently referred to.

WS-Man, like all of the WS-* protocols, is based on SOAP. A WS-Man aware monitoring program can read performance metrics and write configuration values by performing XML-based SOAP calls to a monitored device.

Although WS-Man seems like A Good Thing, especially since Microsoft is providing it on non-Windows platforms, I think it has several key flaws. First, WS-Man is based on both SOAP and WMI/CIM/WBEM. SOAP requires a considerable bit of glue in order to implement. In Windows, C# and .NET makes it pretty easy. On Unix, you can do it in C++ using Axis for example, you can do it in Java using Sun JWSDP or you can do it in Perl/Python or other SOAP aware scripting language. Each of these has its flaws. The C++ approach is error prone. The .NET or Java approaches require a huge runtime memory footprint. The Perl/Python approach is typeless requiring manual development of SOAP WSDL files instead of reflection-based synthesis. Beyond the SOAP issues, WMI/CIM/WBEM is simply butt-ugly (maybe even fugly). The technology had the misfortune of being designed at a time before Java and C# came into fruition. As a result, it’s extension mechanism is just clunky.

Beyond SNMP, RPC, WMI and WS-Man, there are yet other solutions. Companies that make monitoring software (for example, Microsoft, IBM, HP, BMC, Computer Associates, and others) frequently have their own proprietary monitoring agents that use yet other protocols.

Given all of these unattractive alternatives, it is not suprising that companies don’t diligently monitor all of their systems. The ones who do this best usually end up using a mashup of various mechanisms: SNMP for network hardware, Systems Center/MOM for their Windows systems, some Nagios for agent-less monitoring, toss in some HP OpenView in one or two divisions and some home grown stuff elsewhere.

What would Alan Turing do? Ack. I suppose WS-Man is better than the alternatives but I just can’t imagine Cisco adding all the necessary software to implement it.

Best Practices vs. Practical Reality

Sunday, June 22nd, 2008

I’m struck by the huge gap that I see between acknowledged best practices and what companies are actually practicing. More than once, I have researched a product direction and decided not to pursue it because it seemed to me that the market must already be saturated. Later, when talking to customers, I’m stunned to find that hardly any of them have bought or are using the product in question. How can a market support a dozen companies when none of them seem to have any market share?

If you are a regular reader of eWeek, Infoworld, CIO Insight and others you might be lead to believe that every IT department is:

  • Heavily using virtualization products
  • Using comprehensive network and application monitoring tools
  • Diligently practicing strong security techniques
  • Maintaining audit logs and performing correlation analysis on them
  • Faithfully practicing ITIL techniques

Now, I’m not in Sales, but I’ve been to a lot of sales calls. I’ve probably talked to 100 companies over the last year. Of these, I can count on zero fingers the number of them that are practicing all of things mentioned above. On the other hand, the companies that are practicing none of the above is definitely non-zero!

In most cases, when I ask companies about the items I listed above, they sheepishly admit ot their failings. They know they should be doing these things. In some cases, they’ve even already paid for necessary software but have yet to deploy it. There is some tremendously successful shelfware in the industry.

What to make of all this?

  1. Don’t shy away from markets that seem to be crowded. There is still plenty of “whitespace” in the market where clever products and good companies can succeed.
  2. Don’t assume you can’t compete against software that’s been available for many years. I believe that there’s a lot of enterprise software that suffers from having been written 5 or 10 years ago using brittle programming techniques. A company with a strong engineering team can quickly develop a competing product using modern tools and techniques.
  3. Even good ideas and good products can take a good long time before they’re commonplace in IT. Certainly, most IT departments have good backup/restore infrastructure and good disaster recovery plans. It’s probably taken 10-20 years to make these pervasive practices.
  4. Read the journals, but talk to customers. The rags are way too preoccupied with what the top 5% of the IT innovators are doing. They are much less representative of the 1900 companies in the Fortune 2000.

There’s one other aspect of the problem that I’m still digesting. If there are a dozen competitors in a market, all doing the same thing and none of them are succeeding maybe the solution is to do something else. I’m reminded of a story I heard while taking a quality control course years ago. The story takes place in WWII and describes how an aircraft manufacturer decided on what parts of its planes needed extra armor. It had an engineer studying aircraft returning from combat. The company would look at bullet holes on the planes and add armor where there were no bullet holes. Why? Obviously, the planes that got shot in those places were the ones that didn’t survive the battle.

This is obvious once it’s explained but it requires appreciating what artists call “negative space”. If you don’t know what this is, look for the arrow in the FedEx logo.