Having problems importing large amounts of data? ADO.NET has only one method designed to help with this task: the SqlBulkCopy class exposed in ADO.NET 2.0.
I can't count the number of times I've answered questions in the newsgroups, forums, and from the attendees at the tech conferences that imply that developers think ADO.NET (or any of the data access interfaces) are designed to handle "bulk" operations. The question often sounds like:
"Ah, I've got this 6 million rows of data that I need to get from X (an external data source of some kind), process at the client, and send up to my SQL Server production table. I can't seem to get it all in memory at once and if I cut down the row count, the performance is still killing me—it can take all day to process..."
Although some of the newsgroup responders address the symptoms by suggesting they buy more RAM or faster hard drives, use a data reader with live connections to the source and destination, or find another job, they miss the point. ADO.NET has only one method designed to help with this task: the SqlBulkCopy class exposed in ADO.NET 2.0.
The Existing Alternatives
The concept of bulk data transfers has existed since the earliest days of SQL Server as there has always been a need to import data from other data sources. The low-level TDS interface exposes a series of BCP calls and "events" (if you can call them that) that permit developers (and device-driver developers) access to this functionality on any version of SQL Server (except the Compact edition).
In the past, I've encouraged developers to leverage the (considerable) power and flexibility of the Bulk Copy (BCP) command-line utility or the easier to use Data Transformation Services (DTS) and SQL Server Integration Services (SSIS). These tools that most DBAs know fairly intimately have interactive front-ends that help DBAs, developers acting as DBAs (or moms with kids at their knees) point to virtually any data source and choose the data to import into specified SQL Server tables. The tools deal with a litany of issues, including helping you choose the right data source provider or driver and the correct target database and table(s). This means you can import data from text files, delimited files, raw data files, or intelligent data sources such as Oracle, DB2, or even JET/Access or Excel—as long as it's exposed by an ODBC, OLE DB, or .NET Framework data provider. When you're done describing the bulk copy task you, can save the settings as an SSIS job that can be run on a scheduled basis or as you watch. SSIS is built into SQL Server Management Studio so it's available to all of the SQL Server SKUs (except Express and Compact).
Using the SSIS tools means better performance—both developer performance and data throughput performance. That's because it's not only easy to use, you'll find that SSIS (and its predecessors clear back to TDS BulkCopy) use a special technique to transfer the data to the server. Some of the performance difference is due to the fact that these bulk operations are not logged. Of course, this means it does not make sense to import directly into a production table. If something goes wrong during the bulk-copy operation, you simply clear (truncate) the table and restart the operation. In addition, based on settable options, you can choose to update indexes (or not) or to fire triggers (or not) that index and validate the data. Without the server-side constraints, this means that the data is still "evil" when it arrives in the server. Remember my fourth law: "All data is evil until proven innocent." And this means that you'll need to run validation procedures on the data after it arrives and before it's posted to the production tables.
And yes, this functionality also can be implemented directly from SQL Server by using the TSQL BULK INSERT approach, but that's a story for another day.
ADO.NET 2.0's SqlBulkCopy Method
Yes, there are times when you need to write a custom application to move data into SQL Server—as when you're working with SQL Server Express edition. That's where the SqlBulkCopy class comes in. I can't take credit for the developer team at Microsoft adding a bulk copy class to ADO.NET 2.0, but it's something that I and a number of others have been asking for since... well, for a long time. I wrote about this functionality as exposed by SQL Server's native DBLibrary interface in the first edition of "Hitchhiker's Guide to VBSQL" published in the early 90's. Basically, this class exposes the properties and events implemented by the low-level DBLibrary BulkCopy (BCP) functionality.
Now, walk through the SqlBulkCopy class to examine the properties and settings:
There are several SqlBulkCopy constructors that let you preset the properties (and options). In any case, you'll be passing a SqlConnection object or ConnectionString that addresses the target database. You also can pass a SqlBulkCopyOptions value based on the enumerations, as shown in Table 1. Note that as you enable these additional options you might be adding work that must be done on the server and adversely affect performance. I've added a column in the table to indicate options that can hurt (or help) performance.
Table 1: The SqlBulkCopyOption enumerations
||Validate constraints as data is being imported.|
||Use the default setting for all options (all are disabled by default).|
||As rows are added, fire any applicable triggers.|
||As rows are added, do not assign new identity values.|
||Preserve null values in the target table regardless of specified default values. Unless KeepNulls is specified, null values are replaced by default values defined for the column.|
||Obtain a bulk update lock for the table. If not specified, row locks are applied.|
||Bulk operation runs in a transaction.|
The SqlBulkCopy Class Properties
Table 2 lists the SqlBulkCopy properties that further modify how the BCP operation is to be handled, including how much feedback it returns while running. These properties include:
Table 2: The SqlBulkCopy Properties
||(Int) How many rows are sent per round-trip to SQL Server. Tune for best performance. Recommend around 500/batch.|
||(Int) How many seconds should the application wait for the operation to complete. Set to normal operation length X1.5.|
||(SqlBulkCopyColumnMapping) Maps input data columns to output target table columns. Not required if the source and target column names are the same.|
||(string) Addresses target table in Connection-specified database. Syntax: database.ownerschema.tablename.|
||(Int) Fires the SqlRowsCopied after N complete batch of rows processed. For best performance, do not specify this value or set to a fairly high number to notify no more often than once every 5 seconds or so.|
Executing the WriteToServer Method
The data source for the SqlBulkCopy class can be an array of DataRow objects, an existing DataTable object or any object that exposes the IDataReader interface. This includes any of the .NET data provider classes—even an Odbc or OleDb DataReader. Again, this means you can source data from virtually any data source with a driver. By leveraging ADO.NET's ability to construct data tables from XML, you also can use properly formatted XML documents as your data source. See the DataSet.ReadXML method for more information. If you don't have a driver for your data, you can write your own DataReader implementation. There are several examples of this on the web.
The SqlBulkCopy class starts the BCP operation once the WriteToServer method is called. It accepts any object that supports IDataReader or a DataTable. You can also pass in a DataRowState option to specify that you only want "Modified" rows or rows that have some other specific RowState.
Once initiated by WriteToServer, the SqlBulkCopy class fires the SqlRowsCopied event if you specify a NotifyAfter value. In this event handler, you can abort the operation by setting the e.Abort property to True. The problem I found is that it's tough to pass in a form-level value into this routine as they are not updated until the WriteToServer thread is complete. I expect this will take some fancy multi-threading footwork to implement.
While the operation is running, the Connection State is set to busy—unless you've enabled MARS, in which case other commands are interleaved between the BCP operations. Once you're done, be sure to close the SqlBulkCopy instance. This frees the Connection for subsequent use.
Tips and Hints
You have to have your ducks in a row before you initiate the BCP operation using the SqlBulkCopy class. Here are a few tips:
- Make sure you have access to the target table. This means you'll need to use an account that has both SELECT and INSERT rights. In many cases, I simply run a script to build the target table on the fly before I start loading it with data, but this (generally) requires DBO rights.
- No, it's not a good idea to target an existing production table for reasons I've already mentioned. Import your data into a "work" table and run your validation routines against it. Once the data is pure, fold the data into the production table(s) using stored procedures.
- To gain maximum efficiency, set your batch size fairly large (as in several hundred to several thousand rows) but not too large because it puts an additional burden on the server. I generally experiment with the BatchSize property to see which gives me the best performance and reliability.
- Don't set the NotifyAfter property too low—again, you don't want to interrupt the processing any more than necessary.
- If the SqlBulkCopy operation fails, it's probably a good idea to flush the connection pool—or simply disable pooled connections for this operation. I've seen reports of unusable connections once something goes wrong.
- Over the years, I've also heard of BCP operations crashing SQL Server. Although these reports have dwindled with more recent versions, you might be wise to test these bulk operations thoroughly before betting your company on them.
Avoid any process that brings data to the client, processes it, and sends it back to the server—even with SqlBulkCopy. 99% of the time you can process the data far more efficiently on the server using TSQL or CLR stored procedures. Do you feel uncomfortable with these approaches? Perhaps they need to be added to your list of skills...
I have posted a full project to my Hitchhiker's Guide to Visual Studio and SQL Server (7th Edition) support site so registered readers can download the completed code. I'll provide the highlights here to get you started...
Part 1 of the example includes code to establish the SqlConnection to the target database—Biblio in this case (the sample database provided with my book). I then execute a SQL Command batch (using a custom class) to create the target database table using a TSQL script. Notice that the slqBCbc object is declared "WithEvents" to expose the SqlRowsCopied event.
Part 2 illustrates code to open a connection using the ODBC "text" provider and expose a DataReader on this connection. Sure, you could have opened a custom file and parsed it into a DataTable or an array of DataRows if that's necessary—the WriteToServer method will accept either.
Click here for a larger image.
Part 3 shows how the code instantiates a new SqlBulkCopy instance and sets the properties. You're right. These properties are sub-optimal. I have them set this way to illustrate how the NotifyAfter event works. I'm only moving about 3500 rows and with larger BatchSize and NotifyAfter values, it would be over in a heartbeat.
Click here for a larger image.
About the Author
William (Bill) Vaughn is an industry-recognized author, mentor, and subject-matter expert on Visual Studio, SQL Server, Reporting Services, and data access interfaces. He's worked in the computer industry for over thirty-five years—working with mainframe, minicomputer, and personal computer systems as a developer, manager, architect, trainer, marketer, support specialist, writer, and publisher. In 2000, after 14 years at Microsoft, Bill stepped away to work on his books, mentoring, and independent training seminars. He's written seven editions of the Hitchhiker's Guide to Visual Basic and SQL Server and three editions of ADO.NET and ADO Examples and Best Practices for Visual Basic (and C#) Programmers. He and Peter Blackburn also wrote the critically acclaimed Hitchhiker's Guide to SQL Server 2000 Reporting Services.
Bill is a top-rated speaker and frequents conferences all over the world, including TechEd, Visual Studio/SQL Connections, DevTeach, and many others. He's also written a wealth of articles for magazines such as MSDN, SQL Server Magazine, Visual Basic Programmer's Journal, .NET Magazine, and many others as well as a regular editorial for Processor magazine. Bill spends considerable time answering questions on the public newsgroups and speaking at INETA user group meetings all over the country and at other speaking venues all over the world. He's available for consulting, mentoring, or custom training. See www.betav.com or www.betav.com/blog/billva for his current schedule and course catalog.