Data is huge! Literally speaking, that is. Management of data is complex, for a single application or for the enterprise. It's interesting to know that the approach to manage large chunks of data has changed very little over the years; technology has improved, but the method by which you apply the technology is still the same. This article will focus on data management; not the specifics, but a more general overview of available options.
There are two basic scenarios:
- The data is larger than the available memory.
- The data is smaller than the available memory.
It comes down to one or the other, but realistically, the second is the only viable option. If the data is larger than the available memory, you will need to break the large data into smaller chunks that you can process one at a time. You can further classify this scenario into input and output data sizes:
- Input data is small, and output data is small
- Input data is small, and output data is large
- Input data is large, and output data is small
- Input data is large, and output data is large
Figure 1 depicts your options (the plus sign is an "or"). You are showing part of the ultimate solution in the picture in that the "small" data is saved and retrieved locally, and "large" data is in some external location.
Figure 1: The Depiction of Data
The assumption you are making is that the data is either too large or too small to fit in the available memory of the process[or]. It does not matter if you have 4MB of RAM or 16GB of RAM; the assumption is that the data that you want to crunch is larger than the space you have available.
The interesting part is that this problem never goes away! Years from now, when a server comes with 4TB of RAM, you will likely be struggling to find out how to crunch your data, which is of the order of 40TB, for example. The more interesting part is that you will likely solve this problem using the same technique you utilize today—paging!
What is paging? Paging is basically the swapping of data between main memory and secondary storage; to get the data you need into a faster media (in other words, local memory and eventually the processor cache), and swap "old" data from main memory back to the disk to create space for the "new" data. The cleverness of any design is and will be how the "old" and "new" data pages are accessed, copied, and restored onto disk and moved into the processor cache (the fastest accessible large memory) to be processed. So, the basic idea is simple, but implementation is very challenging; that's why new and clever algorithms are being introduced by processor vendors, memory vendors, drive vendors, operating system vendors, and so forth. Paging is an effective method of managing data, either logically or physically—logically as in a database caching the latest data retrieved even if the data being cached is larger than the available memory or physically when a processor interfaces with local RAM and the storage drives. Obviously, in normal circumstances you will use a heuristic algorithm to find out what to page-in and page-out.
Basically, you want to remove the page that you no longer need. The difficulty arises when you try to find that section of memory which you no longer need. The problem becomes even more challenging when you are trying to predict the next section of data that you will need to avoid waiting for data.
The performance of any system drops to a screeching halt if and when you have to wait for data. Local memory is much faster than any attached storage device. The moment you have go to a storage device to retrieve data, you might as well use an i386 processor! Regardless, there are many instances when you do have to wait, and ultimately determine or guess what the "old" and "new" data pages are.
Or Not to Wait
If there was a way to look into the future and know when and what data piece you might need, you could account for and have the data ready exactly when it is needed. Wouldn't that be nice? Well, I suggest that you let the physicists at CERN figure out the time travel and "looking into the future" thing, and we mere mortals can focus on simply controlling the future.
Wouldn't it be nice to have a system that could think one step ahead to optimize and prepare itself for what is about to come? You are not looking to control every aspect and not control too much into the "future," but only our next step or at most, a couple of steps. Why? If you look too much into the future and predict, you go back to the heuristic scenario that U spoke about. If you don't, you cannot optimize. The idea is hit a note somewhere in between one that is in tune with our application, users' requests, data profile, and job type!
With any prediction unit, you run the risk of stale data or stale control information (in other words, a job was canceled, but you got the cancel message late). So, your system must compensate for that. Your system can control its next step (n+1), because it inherently knows what that next step is ("know thyself"). Your system, however, cannot control its n+2, n+3, ... too much control could lead to too many wasted resources.
[That Is] the Question
Predicting the next task for a given job is interesting, but trying to expand this notion to a number of jobs is challenging. The paging logic becomes complex when dealing with a multi-job environment, but benefits can only be realized when dealing with a multi-job environment. The logic in not applicable in the micro-sense (a single job), but I will show that it is in fact feasible in the macro-sense (a number of jobs).
All resource managers or schedulers work in the same way: A task comes that is part of an overall job or session; the scheduler looks at the available set of resources and compares them with the requirement of the task[s] queued; the task is assigned to a resource. Done!
Now, assume that you have a number of jobs with each having a number of tasks. Each task has three stages:
- S: Task scheduled
- D: Data for task received by resource
- E: Task starts execution