如何在最短的时间内插入 1000 万条记录? [英] How can I insert 10 million records in the shortest time possible?

查看:22
本文介绍了如何在最短的时间内插入 1000 万条记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件(有 1000 万条记录),如下所示:

I have a file (which has 10 million records) like below:

    line1
    line2
    line3
    line4
   .......
    ......
    10 million lines

所以基本上我想在数据库中插入 1000 万条记录.所以我读取了文件并将其上传到 SQL Server.

So basically I want to insert 10 million records into the database. so I read the file and upload it to SQL Server.

C# 代码

System.IO.StreamReader file = 
    new System.IO.StreamReader(@"c:	est.txt");
while((line = file.ReadLine()) != null)
{
    // insertion code goes here
    //DAL.ExecuteSql("insert into table1 values("+line+")");
}

file.Close();

但是插入需要很长时间.如何使用 C# 在最短的时间内插入 1000 万条记录?

but insertion will take a long time. How can I insert 10 million records in the shortest time possible using C#?

更新 1:
批量插入:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:dt10000000dt10000000.txt'
WITH
(

     ROWTERMINATOR =' 
'
  );

我的表如下:

DATAs
(
     DatasField VARCHAR(MAX)
)

但我收到以下错误:

消息 4866,级别 16,状态 1,第 1 行
批量加载失败.数据文件中第 1 行第 1 列的列太长.验证是否正确指定了字段终止符和行终止符.

Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.

消息 7399,级别 16,状态 1,第 1 行
链接服务器(null)"的 OLE DB 提供程序BULK"报告了错误.提供者没有提供有关错误的任何信息.

Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.

消息 7330,级别 16,状态 2,第 1 行
无法从链接服务器(null)"的 OLE DB 提供程序BULK"中获取一行.

Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

以下代码有效:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:dt10000000dt10000000.txt'
WITH
(
    FIELDTERMINATOR = '	',
    ROWTERMINATOR = '
'
);

推荐答案

不要创建一个 DataTable 以通过 BulkCopy 加载.对于较小的数据集,这是一个不错的解决方案,但绝对没有理由在调用数据库之前将所有 1000 万行加载到内存中.

Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.

您最好的选择(在 BCP/BULK INSERT/OPENROWSET(BULK...) 之外)是从通过表值参数 (TVP) 将文件导入数据库.通过使用 TVP,您可以打开文件、读取一行和发送一行直到完成,然后关闭文件.此方法的内存占用仅为单行.我写了一篇文章,Streaming Data Into SQL Server 2008 From an Application,里面有一个例子这个场景.

Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.

结构的简单概述如下.我假设与上述问题中显示的导入表和字段名称相同.

A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.

所需的数据库对象:

-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO

-- Second: Use the UDTT as an input param to an import proc.
--         Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
   @ImportTable    dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;

-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;

INSERT INTO dbo.DATAs (DatasField)
    SELECT  Field
    FROM    @ImportTable;

GO

使用上述 SQL 对象的 C# 应用程序代码如下.请注意,在此方法中,不是填充对象(例如 DataTable)然后执行存储过程,而是执行存储过程来启动文件内容的读取.Stored Proc 的输入参数不是变量;它是方法的返回值,GetFileContents.当 SqlCommand 调用 ExecuteNonQuery 时调用该方法,后者打开文件,读取一行并通过 IEnumerable 将该行发送到 SQL Servercode> 和 yield return 构造,然后关闭文件.存储过程只看到一个表变量@ImportTable,它可以在数据开始传入时立即访问(注意:数据确实会持续很短的时间,即使不是完整的内容,在 tempdb 中).

C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, @ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).

using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;

private static IEnumerable<SqlDataRecord> GetFileContents()
{
   SqlMetaData[] _TvpSchema = new SqlMetaData[] {
      new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
   };
   SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
   StreamReader _FileReader = null;

   try
   {
      _FileReader = new StreamReader("{filePath}");

      // read a row, send a row
      while (!_FileReader.EndOfStream)
      {
         // You shouldn't need to call "_DataRecord = new SqlDataRecord" as
         // SQL Server already received the row when "yield return" was called.
         // Unlike BCP and BULK INSERT, you have the option here to create a string
         // call ReadLine() into the string, do manipulation(s) / validation(s) on
         // the string, then pass that string into SetString() or discard if invalid.
         _DataRecord.SetString(0, _FileReader.ReadLine());
         yield return _DataRecord;
      }
   }
   finally
   {
      _FileReader.Close();
   }
}

上面的GetFileContents方法被用作存储过程的输入参数值,如下所示:

The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:

public static void test()
{
   SqlConnection _Connection = new SqlConnection("{connection string}");
   SqlCommand _Command = new SqlCommand("ImportData", _Connection);
   _Command.CommandType = CommandType.StoredProcedure;

   SqlParameter _TVParam = new SqlParameter();
   _TVParam.ParameterName = "@ImportTable";
   _TVParam.TypeName = "dbo.ImportStructure";
   _TVParam.SqlDbType = SqlDbType.Structured;
   _TVParam.Value = GetFileContents(); // return value of the method is streamed data
   _Command.Parameters.Add(_TVParam);

   try
   {
      _Connection.Open();

      _Command.ExecuteNonQuery();
   }
   finally
   {
      _Connection.Close();
   }

   return;
}

附加说明:

  1. 经过一些修改,上述 C# 代码可以用于批量输入数据.
  2. 只需稍加修改,上述 C# 代码就可以适用于在多个字段中发送(上面链接的Steaming Data..."文章中显示的示例传入了 2 个字段).
  3. 您还可以在 proc 中的 SELECT 语句中操作每条记录的值.
  4. 您还可以在 proc 中使用 WHERE 条件过滤掉行.
  5. 您可以多次访问 TVP 表变量;它是只读的,但不是仅转发".
  6. 相对于 SqlBulkCopy 的优势:
  1. With some modification, the above C# code can be adapted to batch the data in.
  2. With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
  3. You can also manipulate the value of each record in the SELECT statement in the proc.
  4. You can also filter out rows by using a WHERE condition in the proc.
  5. You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
  6. Advantages over SqlBulkCopy:
  1. SqlBulkCopy 只能插入,而使用 TVP 允许以任何方式使用数据:您可以调用 MERGE;您可以根据某些条件DELETE;您可以将数据拆分成多个表;等等.
  2. 由于 TVP 不是仅用于插入的,因此您不需要单独的临时表来将数据转储到其中.
  3. 您可以通过调用 ExecuteReader 而不是 ExecuteNonQuery 从数据库取回数据.例如,如果 DATAs 导入表上有一个 IDENTITY 字段,您可以将 OUTPUT 子句添加到 INSERT 传回 INSERTED.[ID](假设 IDIDENTITY 字段的名称).或者您可以传回一个完全不同的查询的结果,或者两者兼而有之,因为可以通过 Reader.NextResult() 发送和访问多个结果集.使用 SqlBulkCopy 时无法从数据库中获取信息,但这里有几个关于 S.O 的问题.想要做到这一点的人(至少对于新创建的 IDENTITY 值).
  4. 有关为什么有时整个过程更快的更多信息,即使将数据从磁盘获取到 SQL Server 的速度稍慢,请参阅 SQL Server 客户咨询团队的这份白皮书:使用TVP最大化吞吐量
  1. SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
  2. Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
  3. You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
  4. For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP

这篇关于如何在最短的时间内插入 1000 万条记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆