dcsimg
 

Merging Data Files with Statistica, Part 1

Friday Sep 21st 2018 by Deepak Vohra

Learn about the range of features and GUI provided by Statistica to make it one of the most commonly used data science tools.

The Statistica Platform is ranked in the top five data science platforms by Gartner's new report for 2017, "Magic Quadrant for Data Science Platforms" (https://www.gartner.com/doc/3606026/magic-quadrant-data-science-platforms), previously called "Advanced Analytics Platforms" in 2016. The broad range of features and a Graphical User Interface (GUI) provided by Statistica make it one of the most commonly used data science tools.

Statistica data files are called Spreadsheets, which have rows and columns of data. Rows of data are called cases and column headers for data are called variables. A common issue in data preparation is that different team members are developing or collecting data sets separately and the data sets have to be merged before the spreadsheet may be used. The data could be in multiple data files. We shall discuss how data in two different data files may be merged into a single data file with Statistica.

Statistica supports different types of merge modes for two data files, and these are:

  • Concatenate: When two data files are concatenated, one data file is taken and added (or concatenated) at the right-side of the other data file.
  • Cartesian: Creates a cross product of two data files.
  • Match Casenames: Merges the cases (rows) of one file with the cases of the other files by matching the case names.
  • Match Variables: Merges the rows of one data file with the rows of the other data file by matching the variable names.

We shall start by discussing the Concatenate merge. This tutorial has the following sections:

Setting the Environment

Download and install Statistica Platform. Statistica data files are called Spreadsheets (stored with the .sta suffix). We shall create some Statistica data files in this tutorial. A data file is created with File>New. In Create New Document, select Spreadsheet, as shown in Figure 1.

Selecting New Spreadsheet to create
Figure 1: Selecting New Spreadsheet to create

To save a data file, select File>Save As, as shown in Figure 2.

File>Save As
Figure 2: File>Save As

Concatenating Data Files

First, create the two data files that are to be merged. The data files to be merged would typically have the same number of rows and same or different number of columns. Because data is to be concatenated, the column names would typically be different. None of this is a requirement; two data files could have different number of rows and we shall discuss how to merge such a set of data files also. The objective is to merge the data in one data file with the other so that the 2nd data file is added at the right side of the 1st data file. As an example, create a data file (called wlslog1.sta) with column headers (variables) timestamp, category, and type and the following data (example log data).

4-8-2014-7:06:16,Notice,WebLogicServer
4-8-2014-7:06:17,Notice,WebLogicServer
4-8-2014-7:06:18,Notice,WebLogicServer
4-8-2014-7:06:20,Notice,WebLogicServer
4-8-2014-7:06:21,Notice,WebLogicServer
4-8-2014-7:06:22,Notice,WebLogicServer

The wlslog1.sta data file is shown in Statistica in Figure 3.

Data File wlslog1.sta
Figure 3: Data File wlslog1.sta

Create another data file (wlslog2.sta) with column headers servername, code, and msg, and add the following data (also example log data).

AdminServer,BEA-000365,STANDBY
AdminServer,BEA-000365,RESUMING
AdminServer,BEA-000365,ADMIN
AdminServer,BEA-000331,STARTING
AdminServer,BEA-000365,STARTED
AdminServer,BEA-000360,RUNNING

The wlslog2.sta file is shown in Figure 4. To merge the two data files, wlslog1.sta and wlslog2.sta, click the Data tab and select Merge, as shown in Figure 4.

Data File wlslog2.sta
Figure 4: Data File wlslog2.sta

A Merge Options dialog gets displayed, as shown in Figure 5. The Variables tab is selected by default. Select Mode as Concatenate. Click the File 1 button to select the 1st file to merge.

Merge Options
Figure 5: Merge Options

Select the wlslog1.sta file in the Select Spreadsheet dialog (see Figure 6). Click OK. The wlslog1.sta file gets added to the File 1 field. Similarly, select the 2nd file wlslog2.sta.

Selecting a Spreadsheet to Merge
Figure 6: Selecting a Spreadsheet to Merge

No other configuration is required. By default, an output spreadsheet gets generated and it may be configured with the Options tab, as shown in Figure 7. Keep the default settings for the output spreadsheet.

Options Tab
Figure 7: Options Tab

The two files to be merged get added to the File 1 and File 2 fields, as shown in Figure 8. The default setting for Unmatched Cases fills the data files with the missing values, which implies that empty data is stored for the section of a merged row (case) that does not match from one data file to another. Click OK.

Data Files to Merge
Figure 8: Data Files to Merge

The two data files get concatenated, as shown in Figure 9. The resulting spreadsheet has 6 columns and 6 rows.

Resulting Spreadsheet after a Merge
Figure 9: Resulting Spreadsheet after a Merge

If one spreadsheet were to have more rows than the other, the two spreadsheets would get concatenated just the same. As an example, add an extra row in the 1st spreadsheet (wlslog1.sta) to make 7 rows, as shown in Figure 10.

Extra Row in wlslog1.sta
Figure 10: Extra Row in wlslog1.sta

When concatenated with the 2nd spreadsheet (wlslog2.sta), the resultant spreadsheet has an extra row with missing data for the columns from the 2nd spreadsheet (see Figure 11).

Merged Spreadsheet
Figure 11: Merged Spreadsheet

Conclusion

In this tutorial, we introduced merging data files (also called spreadsheets) in Statistica Platform for data science. We discussed one of the merge modes: Concatenating merge. In a subsequent tutorial, we shall discuss merging by matching casenames and by matching variables.

Home
Mobile Site | Full Site