It is no longer enough to simply buy servers. Why? Not because you can never have enough servers, or blades, or workstations; but because you can never have enough computing power! If you don't agree with this statement, you should stop reading the rest of this article, and go and do something more valuable with your time!
Management of compute-resources has become exponentially more challenging as the sheer number of resources is increasing and the complexity of these resources is becoming less transparent. Resources are becoming more complex as accelerator technologies such as GPU and FPGA becoming more prevalent and multi-core processors such as the IBM Cell are throwing a curved ball to the developers with the concept of heterogeneous multi-core processor.
From the business side of things, requirements keep coming in and demands are more complex; new Quality of Service (QoS) requirements, new response time requirement, and so forth. How did we get here? Where are we going from here and how can we get there? One thing is for sure and that is the fact that there is no "magic." It's a process and we have to go every step of this [5-step] process in order to come out on top. Assuming that "on top" of your game is where you want to be!
The Five Stages
What are these five stages that I am referring to?
- Chaos: Bunch of servers
- Organized Chaos: Management of servers; Grid Computing
- Uniformity: Focusing on performance; Cluster Computing
- Understanding: Focusing on scalability and business needs; Utility Computing
- Order: Seamlessness and adaptation; Cloud and Adaptive Computing
Figure 1 further illustrates these five stages, and the rest of this article will be focused on explaining these five stages.
Figure 1: The five stages of datacenter management
First: There Was Chaos
The first goal for the IT part of any organization is "availability." Most of the time, this directly translates to the notion of buying two of everything and having a backup for every major system. Not much is done in terms of management or automation. When a server fails, the backup takes over. The problem with this approach is the fact that at most one could only achieve 50% utilization (one system at 100% utilization, one system waiting a failure at 0% utilization). The other flaw with this "server farm" approach is that the IT organization is mostly playing "catch up;" in other words, waiting for something to go wrong and then try to fix the problem. This is a reactive approach to the problem, and not proactive. What makes sense here is to be more proactive and add a layer of management to the infrastructure.
Grid: Chaos. Organized.
Why is Grid chaotic? The whole purpose of Grid is to increase utilization of the resources. This is achieved by effective management of the infrastructure. This comes at a great cost: availability! For an organization to choose Grid over a Server Farm, it must evaluate the trade-offs between availability vs. utilization.
A Grid infrastructure promises higher utilization of resources. These "resources" are the Disaster Recovery (DR) resources purchased in case of a primary node failure. The Grid starts utilizing those resources (albeit to achieve a better Quality of Service), but this means those resources are no longer immediately available in case of failure. The key phrase here being "immediately." Obviously, the proper Grid infrastructure can increase the utilization of the existing infrastructure, but we are focusing on the utilization of resources in a datacenter, and of a specific resource.
Grid does add a layer of provisioning and management to the infrastructure. The infrastructure is more proactive in that it has some capability to seek proper resources for a given job. In heterogeneous environments, this "seek and you shall find" mentality is very useful. This model can result in lower network efficiency, but having more resources are available at your fingertips will increase the quality of service (QoS).
Wants and wishes turn to requirements after a Grid solution has been deployed. These requirements usually have a tighter response time and better QoS definitions. It turns out that although Grid's ability to tie together heterogeneous environments and make best use of underlying resources is desirable; its inability to use the network efficiently renders itself problematic when response-time requirements get tighter.
A cluster environment is usually composed of homogenous machines in close-proximity such as a datacenter. This tighter coupling of resources allows for a better response time, and the close-proximity of resources allow for a might reasonable recovery in case of failure. You are, however, running the risk of lower utilization because you are excluding some of your resources from your cluster for performance reasons. The obvious next step here is to mix the two environments and make one large compute backbone for the enterprise.
Utilitarianism: Key to Understanding
Sharing is key! The ability to share resources of different types among users is a powerful and yet mostly underutilized concept. As discussed in the previous section, the goal here is to combine all resources: slow and fast, desktops and blades, and so on. A concept known mostly as the single-system image (SSI) allows for both environments (Grid and Cluster) to work side by side. Geographically disjointed, but still accessible and can meet the QoS requirements.
This is a difficult stage to get to because there are two ends of the spectrum to meet: on one end, you need to meet tight requirements and meet stringent QoS requirements; on the other end, you need to increase utilization of your resources and still not be heterogeneous and ad-hoc in resource acquisition.