How can I insert 10 million records in the shortest time possible?
I have a file (which has 10 million records) like below:
line1
line2
line3
line4
.......
......
10 million lines
So basically I want to insert 10 million records into the database. so I read the file and upload it to SQL Server.
C# code
System.IO.StreamReader file =
new System.IO.StreamReader(@"c:\test.txt");
while((line = file.ReadLine()) != null)
{
// insertion code goes here
//DAL.ExecuteSql("insert into table1 values("+line+")");
}
file.Close();
but insertion will take a long time. How can I insert 10 million records in the shortest time possible using C#?
Update 1:
Bulk INSERT:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
ROWTERMINATOR =' \n'
);
My Table is like below:
DATAs
(
DatasField VARCHAR(MAX)
)
but I am getting following error:
Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Below code worked:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
);
Solution 1:
Please do not create a DataTable
to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.
Your best bet (outside of BCP
/ BULK INSERT
/ OPENROWSET(BULK...)
) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.
A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.
Required database objects:
-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO
-- Second: Use the UDTT as an input param to an import proc.
-- Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
@ImportTable dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;
-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;
INSERT INTO dbo.DATAs (DatasField)
SELECT Field
FROM @ImportTable;
GO
C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents
. That method is called when the SqlCommand
calls ExecuteNonQuery
, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord>
and yield return
constructs, and then closes the file. The Stored Procedure just sees a Table Variable, @ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;
private static IEnumerable<SqlDataRecord> GetFileContents()
{
SqlMetaData[] _TvpSchema = new SqlMetaData[] {
new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
};
SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
StreamReader _FileReader = null;
try
{
_FileReader = new StreamReader("{filePath}");
// read a row, send a row
while (!_FileReader.EndOfStream)
{
// You shouldn't need to call "_DataRecord = new SqlDataRecord" as
// SQL Server already received the row when "yield return" was called.
// Unlike BCP and BULK INSERT, you have the option here to create a string
// call ReadLine() into the string, do manipulation(s) / validation(s) on
// the string, then pass that string into SetString() or discard if invalid.
_DataRecord.SetString(0, _FileReader.ReadLine());
yield return _DataRecord;
}
}
finally
{
_FileReader.Close();
}
}
The GetFileContents
method above is used as the input parameter value for the Stored Procedure as shown below:
public static void test()
{
SqlConnection _Connection = new SqlConnection("{connection string}");
SqlCommand _Command = new SqlCommand("ImportData", _Connection);
_Command.CommandType = CommandType.StoredProcedure;
SqlParameter _TVParam = new SqlParameter();
_TVParam.ParameterName = "@ImportTable";
_TVParam.TypeName = "dbo.ImportStructure";
_TVParam.SqlDbType = SqlDbType.Structured;
_TVParam.Value = GetFileContents(); // return value of the method is streamed data
_Command.Parameters.Add(_TVParam);
try
{
_Connection.Open();
_Command.ExecuteNonQuery();
}
finally
{
_Connection.Close();
}
return;
}
Additional notes:
- With some modification, the above C# code can be adapted to batch the data in.
- With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
- You can also manipulate the value of each record in the
SELECT
statement in the proc. - You can also filter out rows by using a WHERE condition in the proc.
- You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
- Advantages over
SqlBulkCopy
:-
SqlBulkCopy
is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can callMERGE
; you canDELETE
based on some condition; you can split the data into multiple tables; and so on. - Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
- You can get data back from the database by calling
ExecuteReader
instead ofExecuteNonQuery
. For example, if there was anIDENTITY
field on theDATAs
import table, you could add anOUTPUT
clause to theINSERT
to pass backINSERTED.[ID]
(assumingID
is the name of theIDENTITY
field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed viaReader.NextResult()
. Getting info back from the database is not possible when usingSqlBulkCopy
yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly createdIDENTITY
values). - For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP
-
Solution 2:
In C#, the best solution is to let the SqlBulkCopy
reads the file. To do this you need to pass an IDataReader
direct to SqlBulkCopy.WriteToServer
method. Here is an example: http://www.codeproject.com/Articles/228332/IDataReader-implementation-plus-SqlBulkCopy
Solution 3:
the best way is a mix between your 1st solution and 2nd,
create DataTable
and in the loop add rows to it then use BulkCopy
to upload
to DB in one connection use this for help in bulk copy
one other thing to pay attention that bulk copy is a very sensitive operation that almost every mistake will void the copy, such if you declare the column name in the dataTable as "text" and in the DB its "Text" it will throw an exception, good luck.