RapidMiner is an open source data science software for enterprise. It provides a Graphical User Interface and offers a broad range of features including support for processing Big Data on Apache Hadoop with the Radoop client software. It is one of the most commonly used data science tools, according to the Rexer Analytics annual Data Miner Survey. RapidMiner offers all the data science features required by a data scientist, but we shall discuss only one of the features: merging database tables.
A common challenge in data preparation is that team members collect data sets separately and data sets have to be merged before data may be analyzed. Before data is merged, it may reside in multiple database tables. In this tutorial, we will discuss how data in two different database tables may be merged into a single database table using RapidMiner. We will use MySQL database tables as an example in this tutorial. The underlying concepts discussed in this tutorial apply to any type of database supported by Rapid Miner. This article has the following sections:
- Setting the Environment
- RapidMiner Process Operators for Merging Database Tables
- Creating the MySQL Database Tables to Merge
- Creating a RapidMiner Process
- Creating a MySQL Database Connection
- Merging Database Tables with the Union Operator
Download and install RapidMiner Studio 7.3.x (or a later version) from https://my.RapidMiner.com/nexus/account/index.html#downloads. We shall use MySQL database for example tables to merge. Download and install MySQL database from https://dev.mysql.com/downloads/.
RapidMiner is based on using RapidMiner operators in a RapidMiner "process." Some of the operators are specialized for specific types of input and output. For a database, the following (Table 1) operators are provided to read, write, and update a database table.
|Read Database||Reads an ExampleSet from SQL relational database. Stores data in memory.|
|Write Database||Writes an ExampleSet to a SQL database.|
|Stream Database||Reads an ExampleSet from SQL database but does not read the data into memory. Instead, keeps the data in the database, which allows data of varying sizes implying large database tables to be read.|
|Update Database||Updates an existing SQL database table.|
Table 1: RapidMiner Process Operators for Databases
To merge or combine data, which could be from another data source and not just an ExampleSet from a SQL database, RapidMiner provides several operators, as discussed in Table 2.
|Join||Joins two ExampleSets. Supports inner join, left outer join and right outer join, and full outer join.|
|Append||Appends or merges two ExampleSets into a single ExampleSet. The ExampleSets must be compatible, which implies that they should have the same number of attributes and attribute names. If one ExampleSet has a different number of attributes or different attribute names than the other, the Append operator does not merge the ExampleSets.|
|Union||Merges two ExampleSets to create a resultant ExampleSet. The ExampleSets do not need to have the same number attributes or attribute names. The resultant ExampleSet has attributes from both ExampleSets.|
|Merge||Merges attribute values from a single ExampleSet. Merge is not the operator we need because we shall be merging two different ExampleSets obtained from two different database tables.|
Table 2: RapidMiner Process Operators for Databases
Next, create two MySQL database tables to merge. First, log in to the MySQL client interface (CLI), as shown in Figure 1.
Figure 1: Logging into MySQL CLI
List the databases in the MySQL database with the show databases command, as shown in Figure 2.
Figure 2: Listing Database with show databases
Set the database to use as "mysql," as shown in Figure 3.
The example database tables are for log data. Some other example database tables may also be used, but the procedure to merge shall be the same. Create database table WLSLOG1.
CREATE TABLE WLSLOG1(LOGID VARCHAR(255) PRIMARY KEY, CODE VARCHAR(255),MSG VARCHAR(255));
Add log data to WLSLOG1.
INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log1", "AdminServer", "BEA-000365","Server state changed to STANDBY" ); INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log2", "AdminServer", "BEA-000365","Server state changed to STARTING"); INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log3", "AdminServer", "BEA-000365","Server state changed to ADMIN");
Table WLSLOG1 gets created, as shown in Figure 3.
Figure 3: Creating table WLSLOG1
Similarly, create database table WLSLOG2.
CREATE TABLE WLSLOG2(LOGID VARCHAR(255) PRIMARY KEY, CODE VARCHAR(255), MSG VARCHAR(255));
Add example data to WLSLOG2.
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log4","AdminServer","BEA-000331", "Server state changed to RESUMING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log5","AdminServer","BEA-000365", "Started WebLogic AdminServer"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log6","AdminServer","BEA-000365", "Server state changed to RUNNING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log7","AdminServer","BEA-000360", "Server started in RUNNING mode");
MySQL database table WLSLOG2 gets created, as shown in Figure 4.
Figure 4: Creating table WLSLOG2
Create a third table, WLSLOG, for the merged output of the other two database tables.
CREATE TABLE WLSLOG(LOGID VARCHAR(255) PRIMARY KEY, CODE VARCHAR(255),MSG VARCHAR(255));
The table for the merged data gets created, as shown in Figure 5.
Figure 5: Table for merged data
A table for the merged table is not required because a provision for creating a new database table for the merged table data is also provided. We shall discuss both options, creating a new table and using an existing table.
Before we are able to use the RapidMiner operators, we need to create a RapidMiner process. Select File>New Process, as shown in Figure 6, in RapidMiner.
Figure 6: File>New Process
In Choose a template to start from, select Blank, as shown in Figure 7.
Figure 7: New Process>Blank
A new Process gets created, as shown in Figure 8.
Figure 8: New Process
Initially, a Blank Process is blank, as shown in Figure 9.
Figure 9: New Process is Blank
RapidMiner provides database drivers for several SQL databases. To find if a specific database provider is supported, select Connections>Manage Database Drivers, as shown in Figure 10.
Figure 10: Connections>Manage Database Drivers
The supported database drivers are listed in Manage Database Drivers, as shown in Figure 11. MySQL is already supported and so are the other commonly used SQL databases, such as PostgreSQL, Oracle, and SQL Server. But, if a database is not listed, click Add to add a configuration for the database driver. Select the MySQL database driver to display the settings for MySQL database driver. Click Save to exit the Manage Database Drivers.
Figure 11: Manage Database Drivers
To create a database connection, click Connections>Manage Database Connections, as shown in Figure 12.
Figure 12: Connections>Manage Database Connections
In Connection Details, specify a Name (MySQLDB), and select Database system as MySQL, as shown in Figure 13. Specify Host as "localhost" and Port as 3306. Specify Database scheme as "mysql". Specify User as "root" and specify the Password for the root user as configured while installing MySQL database. Click Test to test the connection configuration. If the connection gets established, a "Connection ok" message gets displayed.
Figure 13: Test
Click OK to complete the connection configuration, as shown in Figure 14.
Figure 14: Configuring database connection for MySQL
Next, we shall demonstrate merging the two example MySQL database tables with the Union operator. First, we need to create two ExampleSets from the two database tables, WLSLOG1 and WLSLOG2. The Read Database operator is used to create an ExampleSet from a database table. Right-click in the Process region and select Insert Operator>Data Access>Database>Read Database, as shown in Figure 15.
Figure 15: Insert Operator>Data Access>Database>Read Database
A diagram for the Read Database operator gets added to the process, as shown in Figure 16. To add a 2nd Read Database operator for the 2nd database table, right-click in the Process region and select Insert Operator>Data Access>Database>Read Database again.
Figure 16: Adding a 2nd Read Database
A 2nd Read Database diagram gets added. Next, we shall configure the Read Database operators. Click Read Database for the first database table. In the tab Parameters for Read Database, select predefined for define connection. In the connection field, select MySQLDB. For define query, select table name, as shown in Figure 17.
Figure 17: Setting Read Database parameters
Select table name as wlslog1, as shown in Figure 18. Keep the default setting for datamanagement.
Figure 18: Setting table name
Similarly, for the Read Database (2) operator, select in the Parameters tab predefined for define connection. In the connection field, select MySQLDB. For define query, select table name. In table name, select the wlslog2 table, as shown in Figure 19.
Figure 19: Setting Read Database (2) parameters
Next, add the Union operator. Right-click in Process and select Insert Operator>Blending>Table>Joins>Union, as shown in Figure 20.
Figure 20: Process>Insert Operator>Blending>Table>Joins>Union
A diagram widget for the Union operator also gets added to the Process, as shown in Figure 21.
Figure 21: Union operator
Next, we need to connect the operators. To connect one operator with another, click a handle provided for the operator and drag to a handle on the operator to connect to, as shown in Figure 22.
Figure 22: Connecting Process Operators
The connection link is color coded to indicate the status of a connection. A connection that is complete is indicated with the light blue color, as shown for the Read Database to Union connection in Figure 23.
Figure 23: Connection links are color coded to indicate status: light blue for completed connection
The two Read Database operators connected to the Union operator are shown in the Process in Figure 24.
Figure 24: Two Read Database operators
Next, add the Write Database operator to output a merged (or, more accurately, a union) database table. Right-click in Process and select Insert Operator>Data Access>Database>Write Database, as shown in Figure 25.
Figure 25: Process>Insert Operator>Data Access>Database>Write Database
A widget for Write Database gets added, as shown in Figure 26.
Figure 26: Write Database
Configure the parameters for Write Database, as shown in Figure 27. Select the define connection field as predefined, connection as MySQLDB, and table name as wlslog.
Figure 27: Parameters for Write Database
Connect the output of the Union operator with the input of Write Database, as shown in Figure 28.
Figure 28: Process Operators configured and connected
To arrange the operators added, select Process>Process Layout>Arrange Operators, as shown in Figure 29.
Figure 29: Arranging Operators
The operators get arranged, as shown in Figure 30. Arranging operators does not have any effect other than to provide a better graphical layout.
Figure 30: Arranged Operators
The table for output of Write Database is created if not already created. By default, the Write Database operator creates a new table. Because we already created a database table, we need to set a parameter to overwrite the existing database table wlslog. Select Show advanced parameters, as shown in Figure 31.
Figure 31: Show advanced parameters
Select overwrite mode as overwrite first, append then as shown in Figure 32, which would first overwrite an existing table's column values and subsequently append. The other options are overwrite and append.
Figure 32: The overwrite mode options
As we shall demonstrate later, the overwrite mode is required to be set for an existing table, as shown in Figure 33.
Figure 33: Setting overwrite mode is required for an existing table
After all the process operators have been configured and the connections between them set, select Process and enable Synchronize Meta Data with Real Data and Validate Automatically. To run the process, click Process>Run Process Locally.
Figure 34: Process>Run Process Locally
The process gets started. First, the Read Database operator is run on the first database table, as indicated by the in-progress graphic in Figure 35. Subsequently, an ExampleSet is created from the 2nd Read Database operator. The Union operator creates a union of the two ExampleSets. And, the merged output data is added to the wlslog table with the Write Database operator.
Figure 35: Merge process started
The process runs to completion, as shown in Figure 36.
Figure 36: Process runs to completion
Run a SELECT query for the WLSLOG table in the MySQL CLI. The merged database table data from WLSLOG1 and WLSLOG2 gets listed for the WLSLOG table, as shown in Figure 37.
Figure 37: Merged database table WLSLOG
To demonstrate that the overwrite mode must be set to other than the default none, keep the default setting of none and run the process. A database error gets generated, indicating that the wlsog table already exists, as shown in Figure 38.
Figure 38: Database error
In this article, we discussed merging database tables with RapidMiner. We used MySQL database tables as an example, but any of the supported SQL databases may be used. In a subsequent article, "Using RapidMiner with MySQL Tables," we shall discuss another process operator, APPEND, to append database tables. We shall also discuss the difference between Union and APPEND. We shall also discuss the UPDATE operator for updating a database table.