By Taras Bachynskyy, Data Architect, SoftServe Inc.
Recently I was reviewing several research reports by leading technology research companies such as Gartner, 451 Research, TDWI and others. Some reports were really valuable, but overall, none of them satisfied me completely, so I decided to classify database technologies in terms of their application and create a single-view representation for all technology groups such as RDBMS, Key-Value stores, Graphs, etc.
The main goal was to create a classification to help software developers evaluate a database technology and show where a particular technology could be applied. It was important for me to build a visual representation instead of plain text explanation because visualized information is perceived much easier.
Unfortunately, it is practically impossible to show multiple perspectives on the same figure. Usually you can show up to 3-4 perspectives in the plain diagram. Therefore, it would be better to consider the perspectives seen from different points of view as far as there are many measurements and comparison criteria. In the first approximation, I decided to consider the “most popular” perspectives and design a high-level classification based on my experience and published research.
First, I selected most common database technologies and introduced the term “technology group”, because a technology itself could be interpreted as a vendor specific product or a physical implementation. For instance, Microsoft SQL Server and Oracle Database differ from each other as products but belong to the same technology group, RDBMS. And it is very important to separate concept from physical implementation. The following groups have been selected for evaluation:
- RDBMS (associated with ORDBMS, OLTP, etc.)
- Big Table
- Hadoop Ecosystem
Initially, an “In-Memory” technology group was on the list as well; in reality “In-Memory” is just a physical implementation or property, which can belong to any technology group. For example, “key-value” + “in-memory” = “memcache”.
“NewSQL” (please do not confuse with "NoSQL") is not on the list of technology groups for the same reason as “In-memory”. NewSQL combines several technologies and the main goal is to beat the CAP theorem. For this purpose, various techniques have been applied: in-memory, distributed queries and other. Looking ahead, if you apply a “NewSQL” rectangle in the diagram, it will cover many sections just because NewSQL itself is a combination of technologies.
“Column Family + Key-Value” and “Big Table” are combinations of technologies as well but each of them are technologies themselves and have different purposes. They have their own place in the market and that’s why they are represented as separate rectangles in the generalized view.
There are a lot of perspectives and comparison criteria that can be applied to databases technologies, such as capacity (in our case its data volume supported by particular technology), supported level of data consistency (BASE, ACID, etc.), availability, scalability, nature of data (data volumes, structured or non-structured data), use cases (transactional systems, analytical systems, etc.) and others. It’s too complicated to visualize all of them in one diagram, so I decide to take the most common measures:
- horizontal axis - Data Volumes
- vertical axis - Use Cases
The use case map diagram reflects use cases and data structure. Each color represents a group of use cases.
Again, I would like to emphasize that this is conceptual comparison rather than vendor-specific implementation. Let’s say, Microsoft Analysis Services has built-in algorithms for predictive analysis, but those algorithms are just an add-on to the OLAP technology (for example, there are third-party vendors such as R that have functionality for advanced analysis and supports integration with Vertica, etc.).
OLAP Cubes vs. Column-Oriented Storages
The main difference between column-oriented DBs and Cubes is that analytical systems based on сolumn-oriented storages are more agile; they have much better ad-hoc capabilities and "time to market". In order to add new slice to the cube, you will need to undergo the full development life cycle: ETL, cube design and semantic model changes.
- Column-oriented stores dominate in the field of transaction/low-level analysis in comparison with Cubes (leaf level calculations in case of MOLAP);
- Cubes have a nice functionality for hierarchy-based calculations and analysis. I would say this is MDX vs SQL.
- Column-oriented storages are highly scalable. For example, HP Vertica could operate with 10х TB of data, Amazon Redshift supports PB-scale solutions
Key-Value vs. Document-Oriented
On the diagram above, Key-Value technology is slightly intersecting with the “Analysis and Processing” section because this is a characteristic of a pure key-value technology. Read by key is the main feature of Key-Value storage. Of course, Map-Reduce is also available, but processing of the “value” part is usually performed on the client side. Both key-value and document-oriented are mainly being used in web solutions.
Document-oriented DBs have a defined structure (they are seamless, but “scheme” is the main word in this case). That means that it gives more ability to query and process data, you are not limited to query by key. And in case of MongoDB, you are not limited by Map-Reduce.
Key-Value approach is the best choice if you need to query by key only. It gives fast read-write operations like put-get requests with simple data. One of the possible applications of key/value is a genome alignment since the genome has very simple structure: position and simple values. In most cases you just need to read value by position in the genome.
Brown vs. Yellow
There is a reason why I named the section "Brown vs Yellow". On the diagram above both Column-oriented databases and Hadoop Ecosystem have a fairly large area of overlapping - this is because they both can operate with quite large data volumes. Of course, Hadoop is a winner in the field of data volumes, but the main difference between them is their application/use cases and data structure.
For example, if you are going to operate with unstructured data, Column-oriented databases will not help you. And there is no reason to use Hadoop if your data is structured (of course if data volumes are not extremely large, and I’m talking about PB-scale databases). It is important to understand what the source is for your data.
It is really important to understand that analytics solutions based on Column-oriented technologies are more self-service in comparison with Hadoop (I’m talking about semantic models provided by BI vendors such as MicroStrategy, Tableau and others). A semantic model makes business people more independent from IT. Of course in practice it is not so easy, but a properly planned and designed solution could be self-service more or less. Hadoop-based solutions require more software development and usually it is more complex development. Nowadays, most of BI software vendors support Hadoop and Hive as data sources and this simplifies the task, but map-reduce is still a bottleneck. That is why Hadoop and HBase are not suitable for advanced ad-hocking. A lot of vendors are still working on this issue, so I hope that this situation will change in the near feature.
One of the differences between Column storages and Hadoop is a "reaction". A solution based on column-oriented storages provides you with low latency data access. Low-latency cannot be reached in Hadoop out-of-the box. Hadoop scans a file even if the answer is in the first row of the file, so “simple questions” will be hard-processed anyway and the key point here that the simple queries will be much slower in comparison with queries to column-oriented databases. Also neither Hadoop nor HBase operate well with small sets (because Hadoop is not a low-latency technology).
You can see that Hadoop ecosystem and other technologies marked in Brown do not replace each other; each technology has its own purpose and I am going to disclose that part in my next articles.
Today, there are many database technologies in the marketplace, each of them covering a specific need. The truth is that these trends emerged earlier, and now it is very important to consider this fact during solution design. Technologies are becoming so mature that they are massively used by a wide range of software venders and the community is growing fast enough. In my view, Hadoop is not yet mature enough, but I cannot say the same about, for example, document-oriented and column-oriented storages – they are widely used and this fact is difficult to argue. Under these circumstances, software developers need to clearly understand which technology should be applied and how, depending on the requirements.
About the Author
Taras Bachynskyy works as Data Architect at SoftServe, Inc. (www.softserveinc.com), a leading global provider of software development, testing and technology consulting services. Taras has more than 9 years of experience in design and implementation of enterprise-level data warehouses, reporting solutions, transactional systems and data integration solutions. He can be contacted at firstname.lastname@example.org.