Whatever it is that you are trying to do, one thing is certain and that is the need for more compute power is always increasing. You want to do more in less time, be more efficient, cheaper, and more reliable. If you have 'x' amount of compute power today, you want '2x' tomorrow, and so on and so forth. The more you do, the more you realize what else you want do; thus, you crave more. This cycle forces you to come up with innovative ways of achieving your goals, whether it being multi-processor systems, multi-single processor systems, or many other solutions that have surfaced over the past few decades.
I want to briefly and rather brutally cover the very basics of computing. To decrease the overall runtime, one needs to decrease one or more of the following components:
- Communication time
- This is the time that it takes to get the data and the command from point A to point B. Remember, if you telling a processor to do something, first you need to tell it what that thing is that you want it to do, and the data that the command needs to run against. The time that it takes to get these two pieces of information to the processor is crucial in the overall runtime. This is especially true if you are talking about distributed systems where the components are dispersed maybe even globally.
- Startup time
- This is the time that it takes for the 'manager' to assign the CPU to do whatever it is that you are asking it to do. As you can imagine, this is probably one of the most difficult parts of the system. This is where the Operating System or the Resource Manager is usually the bottleneck. Imagine, if you will, that you have 1000s of requests and only a limited number of resources at your disposal, the Resource Manager (or also called the Scheduler) needs to handle all of these requests and ensure completion on a very limited number of resources.
- Another scenario that I would like to pose here is that if you only have four jobs and you also had four CPUs, would you need a resource manager? The answer is No! The resource manager or scheduler is required only if you are trying to 'timeshare' your scarce resources among a number of users. You see the scenario I just prescribed in embedded realtime systems where you need to know when a task will run and exactly for how long.
- Processing time
- This is where Moore's law comes into play. This is the classic processor problem where faster processors are sought after to decrease processing time.
- Distributed processing is also another way of looking at this problem. Instead of one very powerful processor, you can use a large number of small processors. There is a number of problems where the sheer number of operations is slowing you down, and not necessarily the speed that a single operation takes. Processors are, generally speaking, fast enough these days that much of the focus is being diverted to multi-processing systems.
As you can see, there is more than just the processing time that will determine how long a task will run. There are many technologies that will decrease some of these components, and I have inserted an example for your viewing:
Table 1: The Three Classes of Problem Realization
Networking and Middleware Problem
Grid Scheduling and OS Problem
Hardware and Processor Problem
There are many other vendors, and these vendors target more than one class of problem, but I have shown only the cream of the crop. The chances are that you will need to implement a solution that covers this spectrum, and that is not easy. I am talking about grand problems, and the solutions to these problems are grander than the problems themselves. One thing is certain: More unconventional methods are emerging. Solutions that were not possible a few years back or never made it out of the research labs are resurfacing due to the advance in technology, open source software, predefined interfaces, and others.
... and All of That
Depending on whom you speak with, you will get a different explanation for one or even all of these "key phrases" that you read in this article. I don't pretend to clear up all the confusion, but rather give you a perspective what each one [might] be perceived as and why. Clusters are what I like to cover first, mainly due to the fact that clusters in the most classical2 sense have been around for many years. If you have been working for a number of years, the name Tandem or Stratus rings a bell. These two computer manufacturers (Tandem now part of HP NonStop, and Stratus still in business) truly understood the concept of cluster computing. Each of the server classes that these two companies made was composed of a number (16 or more) tightly coupled computers interconnected by a communication backbone.
When Intel started supporting Multiprocessing systems with the introduction of the Pentium Pro, one needed to "cluster" four processors into one cluster and have multiple clusters connected via a bridge. Today, when you speak of clusters, you refer to mostly a large number of homogenous computers interconnected in close proximity. If you have a datacenter filled by Dell blades and they all have mostly the same or very similar Operating System version, you have built yourself a cluster. You might think that this explanation is very narrow and restrictive, but think about it for one second, is it really? If you are building a new datacenter, wouldn't you rather have one Operating System to worry about? Wouldn't you rather have a single type of vendor to deal with? Whether you are aiming for a cluster or cluster is the end result, this is what you have.
Don't get confused, as is normally the case, between the meaning of a cluster and a grid. This is where things start getting very interesting as the meaning of these two concepts get a little clouded. As I mentioned, a cluster is normally sought as a homogenous environment and all in close proximity. Depending on whom you ask, one of these two features is more important than the other. For the research community mostly, as long as I am talking about nodes in very close proximity (in other words, a datacenter), I am talking about a cluster. For the commercial world, if I am talking about anything other than the true and classic definition of a cluster, I am talking about a grid. What is the reason for this? I believe the notion of cluster is annotated with reliability and resiliency; after all, the two companies (Tandem and Stratus) tht sold clustered machines had very high reliability built in. When you walk away from that level of tight coupling with off the shelf blades interconnected, you are unable to ensure 99.999% reliability (5-9's).