You have acquired a set of resources, but for you to be effective, you need you to come up with a scheme to use these resources. Remember that you may not have the resources "forever," as it the case with dynamic resources, so you are best advised to make the best of what you have for as long as you have them. Even if the resources are dedicated resources, you still would need a way to ensure high utilization. Resource managers are needed for you to achieve this goal. A typical resource manger uses some sort of a scheduler to ensure proper usage1 or resources by increasing their utilization. Scheduling is the concept of sharing a scarce resource amongst users without starving any of the users, and at best gives the impression that every user has access to all of what that resource has to offer. This poses a challenge when the numbers of users increase dramatically or the duration of the jobs varies greatly. What makes this challenge even greater is that scheduling problems are mostly NP-Complete, with a very limited number of scenarios that are considered to fall under the P-type problem domain.
Single criterion scheduling are problems where the user is interested in maximizing or minimizing only one thing or criterion (minimize the flow, time, or the completion time). Many scenarios, machine shop or otherwise, require more than criteria to be optimized. For example, on a multi-processing machine, you want to minimize startup time and at the same time minimize completion time of all the tasks. There are times where these two criterions conflict; in other words, you might need to suspend a task, thus delaying its completion time, to start a newly arrived task. The point is that "sacrifices" must be made, and that is the point of heuristic-type algorithms; they aim to minimize the overall sacrifice one has to make to optimize everything near-perfectly. This does not always work, however, but considering the problem domain, it is a very good attempt at solving the unsolvable. As you might expect, scheduling shares a number of ideas from optimization theory.
Quality of Service (QoS) for Grid computing has a special meaning because it no longer applies only to network resources. Compute, data, and network resources together need to be managed and there needs to be a mechanism that provides a quantifiable way of dictating QoS across all three domains. Scheduling systems thus need to take QoS guarantees into account when scheduling tasks across resources and administrative domains. The concept of QoS and data scheduling is further complicated when talking about globally distributed and/or dense systems where scheduling becomes more difficult; therefore, meeting QoS guarantees becomes even more complex.
Think of an operating system and how it schedules various threads or processes on the CPU. As the number of CPUs increase, the problem becomes more difficult, but the concept is still the same. These are a number of different scheduling algorithms, but I will not cover them in this article. The main focus here is to break down a resource manager into its core components, and talk about how these components work together to achieve a single goal: high resource utilization.
Resource Manager Components
Conceptually speaking, the resource manager is very simple:
- Queue incoming tasks
- Keep a record of available resources
- Match resources with the incoming tasks (scheduler)
- Queue results
I am not saying that it is easy to design or write a resource manager, however, but from a conceptual standpoint it is a simple enough design that you can relate to. Figure 1 depicts this architecture.
Figure 1: Anatomy of a Resource Manager
There are a number of ways that this architecture can be realized, but the one thing you need to keep in mind here is that network queuing theory plays a major role here. If you have an influx of tasks that is greater than the speed that your processing engine is able to off-load, the client queue will get backed up and you will start to lose tasks. This is the same behavior if you were to talk about a router placed in a network with large amounts of data transfer. Congestion control is implicit in the case of a resource manager as the resources will only be ready and request to process the next task when the current task has already been completed. This makes our understanding of the environment a little easier as if we were seeing a backlog of tasks waiting to be processed, this is a clear indication that we need only to add more resources to assist with the heavy load of the incoming tasks.
Your goal in this article is not to build a resource manager, but rather have a clear and better understanding of how one actually works and what its main components are. Focus a bit on the overall flow. You will delve into the details in the subsequent sections.
The flow is something like the following:
- Resources log on to the Grid resource manager.
- Basic resource information is sent to the resource manager such as OS type, amount of free memory, number of CPUs, and a number of other parameters based on the Resource Manager involved.
- Data and any updates as synchronized between the resource manager and the resource.
- Resource goes in to a waiting queue ready to be assigned a task.
- The resource manager updates the table of available resources with the new resource.
- The scheduling engine assigns a task to the resource if and when a new task is available.
- The resource gets the task and the data, loads the appropriate service, and executes the task.
- The task result is sent back to the client.
- The resource is ready for another task.