I’ve been reading a lot of references to the coming of age of cloud computing. The more I read the more I am disappointed. In many cases, for example Amazon’s EC2, cloud computing seems like marketing-speak for “an easy way to rent virtual machines.” Amazon gives you some Web services that let you on-the-fly, allocate virtual machines. Cool, but not exactly rocket science. Hadoop, or more specifically, Google’s MapReduce programming model, are more along the line of what I’d call cloud computing, but only for a narrow class of programming problems.
The general idea with cloud computing is to be able to use a large network of computers to implement programs with large computing, storage or other resource requirements. As the needs of these programs change, the cloud should be able to easily adapt by adding additional resources on-the-fly.
The objectives of cloud computing are not that much different from the 1980’s objectives of parallel or array processing computers or from the 1990’s objectives of load balancing web applications. The restrictions that we see on would-be cloud computing “solutions” are often just repeating the restrictions of earlier technologies. Parallel computers were great at problems that could be easily parallelized. Unfortunately, there seems to be only one such problem: numerical simulation of fluid dynamics. Okay, I’m exaggerating, but certainly, parallelization (especially, automatic parallelization) has not proven applicable to a wide variety of different domains. Load balancing, too, has proven to be harder than it seems. True, it’s easy enough to deploy extra Web servers to handle HTTP requests, but session-state and database issues have proven more difficult. I would venture to guess that 95% of Internet web applications are still dependent on a single, working, database cluster and/or networked storage array.
A couple of months ago, I got to sit in on a talk by Paul Maritz. I knew Paul at Microsoft and had talked to some of his developers at Pi Corporation. From Paul’s talk, it seems like Pi has done some interesting work with regard to distributed data. They’ve done some clever things, allowing data to exist in multiple places, while also providing local caches available off-line. Paul is a smart guy and I’m sure he’s part of the reason why Pi was acquired by EMC. Nevertheless, Pi seems to have focused primarily on storage and not on general cloud computing issues.
Microsoft, of course, is making noise about cloud computing. This is not surprising considering their late arrival to the party. Something is supposed to be announced by the end of the year but even Ray Ozzie seems to be underpromising what it will be.
What would I like to see in a cloud computing architecture? I think it needs to accomplish several things:
- Address storage, computation and bandwith.
- It can restrict itself to specific application domains, but it has to be more general purpose than MapReduce. It should certainly cover: HTTP/XML Web service applications and computationally intensive problems such as SETI@home and the Folding@home projects.
- It needs to be adaptive to changing needs and available resources.
- It should eliminate single points of failure.
A final thing that I’d like to see that’s not strictly a requirement is that the system be self organizing and not based on centralized control. I want BitTorrent, not Napster.
I have some ideas about how I’d personally design such a system, but nothing worth discussing just yet.