When we look back a few years, we see that two hotspots quickly formed and grew wildly in the IT industry, the Big Data (NOSQL) movement, and mobile programming on smartphones. With Big Data, developers easily got lost in the jungle of new terms and mesmerized with the profound technology landscape. Rookie questions like "What is the difference between Cassandra and Hadoop?", "How does Storm compare to Hadoop?" are quite common across the internet. Targeting JavaEE developers, this three article series aims to jumpstart developers' study and put them on the right learning track of Big Data. The author hopes that this work brings confidence to developers when placed anywhere in the Big Data battleground.
Java Rules, but Challenges Remain
JavaEE, as an enterprise computing platform, offers APIs and runtime environments that makes server-side programming multi-tiered, modularized, reliable, portable, and secure. Besides, innovative techniques from the open source Java community constantly proliferate as complements to core Java and JavaEE. That being said, JavaEE developers still face performance issues and scalability challenges from time to time in their jobs. Once a web application is published online, the volume of users and the size of data are expected to grow continuously. Moreover, the application itself may experience enhancements, expansions, and integration with heterogeneous systems due to EAI (enterprise application integration) requirement. Is there a time you feel helpless when troubleshooting a performance issue? Have you ever been required to take application performance, throughput, and availability to the extreme?
High-Availability (HA) Challenge
Developers almost inevitably experience some downtime in their production systems, either by accident or on purpose. It would not surprise you to visit an e-commerce website and see a maintenance notification page on weekends after midnight. Conceivably, people brought down the servers to upgrade the application, OS, or other software. On the other hand, we hardly see any downtime with big internet companies like Google, Amazon, Facebook, Twitter, and so on. Moreover, these web sites are constantly being improved and upgraded. The capability to run an interactive, transactional website 24/7 around the clock is a high-availability (HA) challenge.
A software system is considered scalable when it is resilient in accommodating a growing workload without requiring major redesign or code refactoring. In many circumstances, scalability is a pre-condition for availability. Imaging there are hundreds of millions of registered users on your e-commerce website with millions of concurrent users at any moment, billions of items, tens of thousands of reads and thousands of writes per second (eBay and Amazon fall in this range.), is your system ready to take that level of stress with reasonable performance? Likewise, performing batch or real-time analytics on a big data set with hundreds of millions of records is equally difficult. Big internet companies all face this kind of scalability challenge. How do they keep up with the demands?
A General Approach Toward HA
The author hereby uses JBoss AS clusters as an example to illustrate a common approach that addresses the HA challenge for web servers. It is all about session distribution, replication and failover. This may not sound like anything new to you. A load balancer (or multiple load balancers) distributes web sessions to nodes in a cluster(s). A sticky session makes sure that once a request from a user is dispatched to and accepted by a server node, all subsequent requests from the same user are directed by the load balancer(s) to the same node unless failover occurred. In case the designated node failed, future requests from the same user would be dispatched to another node where a replicated session of the same user resides. This addresses the situation of accidental node loss.
To provide HA, multiple JBoss AS clusters are needed in a LAN environment (or in multiple data centers connected by WAN for geographical failover). When you need to upgrade web applications, the application server, OS, or other software, you stop one cluster from serving new requests through the load balancer(s). Let the current sessions drain out, and then bring that cluster offline to perform any upgrade, maintenance needed; once ready, bring it back online. Then repeat the same steps for the rest of the clusters. This approach, named “rolling release/rolling upgrade,” does not cause any downtime or interference to user experience.
In the meantime, relational database upgrade is more sophisticated, as it involves schema changes and data migrations. There are tools like DBDeploy to assist DBAs in managing the process. To avoid downtime to the application, a transitional phase is required where both old schema and new schema coexist in the same database. The application has to be adjusted so that it can read data from both the new and old schema, while only write to the new schema during that phase. It is even tougher in migrating the existing data as it always involves database locks, which impair database performance.
Apart from regular upgrades and releases, in order to prevent any accidental downtime due to application faults, testing is the key and needs to be conducted sufficiently and thoroughly. There are simply too many lessons in the industry due to inadequate testing.
Achieves Scalability Under Different Contexts
A server system can be scaled in two different directions, vertically, or horizontally. Scaling vertically (or scale up) means to add more memory, disk space, upgrading CPU, or replacing with a more powerful machine. Scaling horizontally (or scale out) is to add more servers (nodes) to a cluster(s). Vertical Scaling has short-term benefits in that it is almost effortless from both developer's and administrator's perspective. At most, some minor setting adjustments are needed to accommodate the hardware changes; for instance, heap size of JVM. Evidently, vertical scaling has a physical cap. With the growth of business, one day we won't be able to keep up by indefinitely increasing the capacity of the single server. Horizontal scaling doesn't have that limitation, besides, it is cost-effective in a long run along with the HA benefit. However, programming in a horizontally scaled system can be quite involved, like with database sharding and clustering. In addition, managing a clustered system is more difficult for server administrators. Having that said, application server vendors have made clustering easy, if not transparent to developers.
Scales out Web Servers
The nature of a web session is self-contained and isolated from others. The same record/row fetched from the database by different users is presented as multiple copies under different user sessions. There are no interferences between the copies of the same row, despite the persistence technique chosen, such as JPA or JDBC. The in-memory copies of the same record/row can easily become out-of-synch due to user interactions. The nature of web servers (not limited to JavaEE) allows this to happen, and defers concurrency management to the database. Behaving as a single touch point, the database manages data concurrency through transactions and locking (either optimistic or pessimistic). You may argue that a shared cache in JPA makes the records shared across users. The nature of a shared cache is close to an in-memory data grid, which may have transaction and locking features built-in. The above facts make the web/application server easy to scale out. JBoss announced that AS7 can be scaled to 1000+ nodes. Nevertheless, this is not the happy ending of the story. Imaging 1000 application servers all funnel through a single database server; such architecture won’t take off in that the database is not only the bottleneck of performance and throughput, but also a single point of failure (SPOF).
Scales out Databases
The amount of work seems quite intimidating. It would be nice to have a data store that is elastic and smart enough to do automated (even transparent) sharding, re-sharding, and failover to meet your scalability and HA goals. Big Data, NOSQL (Not Only SQL) and NewSQL technologies are invented to bring you these advantages with a manageable amount of cost.
Big Data Movement
It started in 2004 when Google released two papers on MapReduce and GFS (Google File System), which have been used by Google on web search indexing and analytics. The concept immediately inspired Doug Cutting who was in charge of the Apache Lucene and Nutch projects, but faced severe scalability challenges. By quickly implementing a new project under the same idea, Doug's team was able to boost Nutch's scalability phenomenally. That project was later on named Hadoop with the underlying HDFS (Hadoop Distributed File System). Hadoop is an open source framework for writing and running distributed applications that process large amounts of data in parallel.
Data in HDFS is read-only from the consumer's perspective. That essentially makes Hadoop a data warehouse like tool serving but not limited to big data analysis, profiling, and other Business Intelligence (BI) objectives. Evidently, a read-only storage system doesn't serve as an OLTP (OnLine Transaction Processing) database, like a relational database in an e-commerce application. An OLTP database at minimum should support CRUD (create, read, update, delete) data operations. That is where NOSQL and NewSQL databases come into play.
Google BigTable is a compressed, high performance, and column family-oriented NOSQL database built on GFS (collocating with MapReduce). Being part of Google web search solution, BigTable is designed to scale into the petabyte range across hundreds or thousands of machines. Apache HBase built on HDFS is modeled after Google’s BigTable, and often described as a Hadoop database. Likewise, Amazon DynamoDB is another form of column family-oriented database to serve Amazon’s shopping cart. Apache Cassandra originated by Facebook is modeled after DynamoDB. The desired features of automated (or transparent) sharding, re-sharding, and failover are available in these column-family NOSQLs, but with a price due to the constraints of CAP theorem.
Classification of Technologies
The author tentatively groups big data technologies into two categories, big data processing that analyzes data in a read-only fashion, big data persistence (NOSQL and NewSQL) that works as OLTP databases.
There are three sub categories under big data processing,
- Batch processing with Google MapReduce, Apache Hadoop, and Hadoop ecosystem Data processing under this category is full power, full "table" scan with significant (from minutes to hours) lag.
- Stream processing with Twitter Storm, Apache S4 A stream of data is processed at real time in parallel. The slice of data being processed at any moment in an aggregate function is demarcated by a sliding window.
- Fast/Real-time processing with Google Dremel, Cloudera Impala, Apache Drill, and the Lambda Architecture Performing ad hoc analysis on a full big data set within seconds lags. Working effectively as a big data OLAP (OnLine Analytical Processing) system. The amount of data processed per operation in OLAP is substantially more massive than data processed in OLTP. Image how daunting the challenge is!
The second article of this series introduces you big data processing technologies.
If you are looking for a highly scalable and performant database to support your interactive and transactional website, the third article on big data persistence brings you a variety of NOSQL and NewSQL options. It is worth noting that NOSQL is not a synonym of big data, albeit with some overlaps. In addition, CAP theorem along with new data access patterns and practices are discussed with a good level of detail.
Endeavor toward Uniform and Abstraction APIs
The blast of Big Data technologies is both a blessing and curse to JavaEE developers. Unlike the POJO programming and data model with JavaEE and the Spring framework, APIs in Big Data are specific, quickly evolving and short of industry standards. Good news is that there are efforts to abstract these proprietary techniques under uniform APIs. I hereby list a few below for your checklist. (1) Kundera – It provides JPA 1.0 compatible API for accessing HBase, Cassandra and MongoDB databases. (2) Apache Gora - The overall goal for Gora is to become the standard data representation and persistence framework for big data via an easy to use Java-friendly common API. It supports column-family stores (HBase, Cassandra), key-value stores (Redis), document stores (MongoDB), and Relational databases (MySQL, HyperSQL), flat files on HDFS. In addition, it provides indexing API for Lucene and Solr, and analysis API for Apache Pig, Hive, and Cascading. (3) Spring Data/Spring XD - The Spring team took a different philosophy. Instead of presenting an abstraction API, they offer a consistent programming model across different NOSQL stores (HBase, MongoDB, Redis, Neo4j, GemFire) using patterns and abstractions already known from within the Spring framework. Moreover, Spring Data offers data binding infrastructure to map NOSQL native data types to POJO data model. Spring XD (eXtreme Data) is introduced most recently to address common bid data use cases. More details can be found on the official web site.
This article unveils the HA and scalability challenges faced by enterprise application developers. Big Data/NOSQL movement is originated to overcome these challenges. Before you feel agitated with a specific Big Data technology and roll up your sleeves to start coding, it is better to get a big picture of Big Data in advance. The author hopes this works to jump start your study on Big Data, and assist you in making the right design decisions.
About the Author
Xinyu Liu, as a Sun Microsystems certified enterprise architect, Xinyu Liu has intensive software design and development experience with cutting-edge server-side technologies. He took his graduate degree from George Washington University and currently serves as the application architect for Virginia Workers' Compensation Commission. Dr. Liu has written for Java.net, JavaWorld.com, and IBM developerWorks on topics such as JSF, Spring Security, Hibernate Search, Spring Web Flow, the Servlet 3.0 specification, and Drools Rule Engine. He also has worked for Packt Publishing reviewing the books Spring Web Flow 2 Web Development, Grails 1.1 Web Application Development, and Application Development for IBM WebSphere Process Server 7 and Enterprise Service Bus 7.
- Clustering for High Availability (HA) with JBoss AS7: Online presentation from JBoss.
- Hibernate Shards: Simplifying Hibernate programming with sharded databases.
- Google MapReduce: Simplified data processing on large clusters.
- Google BigTable: A distributed storage system for managing structured data.
- Apache Hadoop: A reliable, scalable, distributed computing framework.
- Apache HBase: A distributed, scalable, big data store.
- Apache Cassandra: A distributed, scalable, big data store.
- MongoDB: A JSON document database.
- Neo4j: A graph database.
- Twitter Storm: A free and open source distributed realtime computation system.
- Apache S4: A general-purpose, distributed, and scalable data stream processing system.
- Google Dremel: Interactive analysis of web-scale datasets.
- Cloudera Impala: Real-time queries in Apache Hadoop.
- Apache Drill: Provide low latency ad-hoc queries to many different data sources.
- Big Data: Introduce the Lambda Architecture.
- Kundera: JPA 1.0 ORM library for the Cassandra/Hbase/MongoDB database.
- Apache Gora: An in-memory data model and persistence for big data.
- Spring Data: Modern data access for enterprise Java.
- Spring XD: A unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Spring Social: Connect your applications with SaaS providers such as Facebook and Twitter.
- Hadoop in Action: A book on Hadoop.
- Hadoop in Practice: A book on Hadoop ecosystem.
- HBase in Action: A book on HBase.
- MongoDB in Action: A book on MongoDB.
- NoSQL Distilled: A brief guide to the emerging world of polyglot persistence.