Extraction Transformation Load (ETL) is the backbone for any data warehouse. In the data warehouse world data is managed by the ETL process, which consists of three processes, Extraction-Pull/Acquire data from sources, Transformation-change data in the required format and Load-push data to the destination generally into a data warehouse or a data mart.
SQL Server Integration Services (SSIS) is the tool in the ETL family that is useful for developing and managing an enterprise data warehouse. A data warehouse by its own characterization works on a huge volume of data and performance is a big challenge when managing a huge volume of data for any Architect or DBA.
2 ETL Improvement Considerations
Today, I will discuss how easily you can improve ETL performance or design a high performing ETL system with the help of SSIS. For a better understanding, I will divide ten methods into two different categories; first, SSIS package design time considerations and second configuring different property values of components available in the SSIS package.
2.1 SSIS Package Design Time Considerations
#1, Extract data in parallel; SSIS provides the way to pull data in parallel using Sequence containers in control flow. You can design a package in such a way that it can pull data from non-dependent tables or files in parallel, which will help to reduce overall ETL execution time.
#2, Extract required data; pull only the required set of data from any table or file. You need to avoid the tendency to pull everything available on the source for now that you will use in future; it eats up network bandwidth, consumes system resources (I/O and CPU), requires extra storage, and it degrades the overall performance of ETL system.
If your ETL system is really dynamic in nature and your requirements frequently change, it would be better to consider other design approaches, like Meta Data driven ETL, etc. rather than design to pull everything in at one time.
#3, Avoid the use of Asynchronous transformation components; SSIS is a rich tool with a set of transformation components to achieve complex tasks during ETL execution but at the same time it costs you a lot if these components are not being used properly.
Two categories of transformation components are available in SSIS; Synchronous and Asynchronous.
Synchronous transformations are those components which process each row and push down to the next component/destination, it uses allocated buffer memory and doesn’t require additional memory as it is direct relation between input/output data row which fits completely into allocated memory. Components like Lookup, Derived Columns, and Data Conversion etc. fall into this category.
Asynchronous transformations are those components which first store data into buffer memory then process operations like Sort and Aggregate. Additional buffer memory is required to complete the task and until the buffer memory is available it holds up the entire data in memory and blocks the transaction, also known as blocking transformation. To complete the task SSIS engine (data flow pipeline engine) will allocate extra buffer memory, which is again an overhead to the ETL system. Components like Sort, Aggregate, Merge, Join, etc. fall into this category.
Overall, you should avoid Asynchronous transformations but still, if you get into a situation where you don’t have any other choice then you must aware of how to deal with the available property values of these components. I’ll discuss them later in this article.
#4, Optimum use of event in event handlers; to track package execution progress or take any other appropriate action on a specific event, SSIS provides a set of events. Events are very useful but excess use of events will cost extra overhead on ETL execution.
Here, you need to validate all traits before enabling an event in the SSIS package.
#5, Need to be aware of the destination table schema when working on a huge volume of data. You need to think twice when you need to pull a huge volume of data from the source and push it into a data warehouse or data mart. You may see performance issues when trying to push huge data into the destination with a combination of insert, update and delete (DML) operations, as there could be a chance that the destination table will have clustered or non-clustered indexes, which may cause a lot of data shuffling in memory due to DML operations.
If ETL is having performance issues due to a huge amount of DML operations on a table that has an index, you need to make appropriate changes in the ETL design, like dropping existing clustered indexes in the pre-execution phase and re-create all indexes in the post-execute phase. You may find other better alternatves to resolve the issue based on your situation.
2.2 Configure Components Properties
#6, Control parallel execution of a task by configuring the MaxConcurrentExecutables and EngineThreads property. SSIS package and data flow tasks have a property to control parallel execution of a task; MaxConcurrentExecutables is the package level property and has a default value of -1, which means the maximum number of tasks that can be executed is equal to the total number of processors on the machine plus two;
EngineThreads is a data flow task level property and has a default value of 10, which specifies the total number of threads that can be created for executing the data flow task.
Data Flow Task
You can change default values of these properties as per ETL needs and resources availability.
#7, Configure Data access mode option in OLEDB Destination. In the SSIS data flow task we can find the OLEDB destination, which provides a couple of options to push data into the destination table, under the Data access mode; first, the “Table or view“ option, which inserts one row at a time; second, the “Table or view fast load” option, which internally uses the bulk insert statement to send data into the destination table, which always provides better performance compared to other options. Once you choose the “fast load” option it gives you more control to manage the destination table behavior during a data push operation, like Keep identity, Keep nulls, Table lock and Check constraints.
OLE DB Destination Editor
It’s highly recommended that you use the fast load option to push data into the destination table to improve ETL performance.
#8, Configure Rows per Batch and Maximum Insert Commit Size in OLEDB destination. These two settings are important to control the performance of tempdb and transaction log because with the given default values of these properties it will push data into the destination table under one batch and one transaction. It will require excessive use of tembdb and transaction log, which turns into an ETL performance issue because of excessive consumption of memory and disk storage.
OLE DB Destination Editor
To improve ETL performance you can put a positive integer value in both of the properties based on anticipated data volume, which will help to divide a whole bunch of data into multiple batches, and data in a batch can again commit into thedestination table depending on the specified value. It will avoid excessive use of tempdb and transaction log, which will help to improve the ETL performance.
#9, Use of SQL Server Destination in a data flow task. When you want to push data into a local SQL Server database, it is highly recommended to use SQL Server Destination, as it provides many benefits to overcome other option’s limitations, which helps you to improve ETL performance. For example, it uses the bulk insert feature that is built into SQL Server but it gives you the option to apply transformation before loading data into the destination table. Apart from that, it gives you the option to enable/disable the trigger to be fired when loading data, which also helps to reduce ETL overhead.
SQL Server Destination Data Flow Component
#10, Avoid implicit typecast. When data comes from a flat file, the flat file connection manager treats all columns as a string (DS_STR) data type, including numeric columns. As you know, SSIS uses buffer memory to store the whole set of data and applies the required transformation before pushing data into the destination table. Now, when all columns are string data types, it will require more space in the buffer, which will reduce ETL performance.
To improve ETL performance you should convert all the numeric columns into the appropriate data type and avoid implicit conversion, which will help the SSIS engine to accommodate more rows in a single buffer.
In this article we explored how easily ETL performance can be controlled at any point of time. These are 10 common ways to improve ETL performance. There may be more methods based on different scenarios through which performance can be improved.
Overall, with the help of categorization you can identify how to handle the situation. If you are in the design phase of a data warehouse then you may need to concentrate on both the categories but if you're supporting any legacy system then first closely work on the second category.