我怎么可以插入一千万条记录在最短的时间内? [英] How can I insert 10 million records in the shortest time possible?

查看:572
本文介绍了我怎么可以插入一千万条记录在最短的时间内?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件(其中有10万条记录)象下面这样:

一号线
    2号线
    3号线
    4号线
   .......
    ......
    10万

所以基本上我想插入一千万条记录到数据库中。
所以我读了文件,并把它上传到SQL Server。

C#code

就是System.IO.StreamReader文件=
    新就是System.IO.StreamReader(@C:\\ test.txt的);
而((行= file.ReadLine())!= NULL)
{
    //插入code到这里
    //DAL.ExecuteSql(\"insert成表1​​的值(+行+));
}file.Close();

但插入将需要很长的时间。
我如何能在最短的时间内插入一千万条记录使用C#?

更新1:结果
BULK INSERT:

BULK INSERT DBNAME.dbo.DATAs
FROM'F:\\ dt10000000 \\ dt10000000.txt
WITH
(     ROWTERMINATOR ='\\ n'
  );

我的表是象下面这样:

DATAS

     DatasField VARCHAR(MAX)

但我得到以下错误:


  

消息4866,级别16,状态行1结果
  大容量加载失败。该列中的数据文件太长第1行,第1列验证字段终止符和行终止正确。


  
  

消息7399,级别16,状态行1结果
  OLE DB提供程序BULK链接服务器(空)报告错误。提供程序未给出有关错误的任何信息。


  
  

消息7330,级别16,状态2,第1行结果
  无法获取来自OLE DB提供程序BULK链接服务器(空)。

一排

下面code的工作:

BULK INSERT DBNAME.dbo.DATAs
FROM'F:\\ dt10000000 \\ dt10000000.txt
WITH

    FIELDTERMINATOR ='\\ T',
    ROWTERMINATOR ='\\ n'
);


解决方案

请做的的创建一个数据表通过BulkCopy加载。这对于较小的数据集的一个确定的解决方案,但也绝对没有理由调用数据库之前加载所有千万行到内存中。

您最好的选择( BCP / BULK INSERT / OPENROWSET(BULK之外。 ..))是通过表值参数(TVP)从文件的内容流到数据库中。通过使用TVP可以打开该文件,读取一行&安培;发送一排,直到完成,然后关闭该文件。这种方法具有仅仅单个行的存储器占用。我写了一篇文章,流数据到SQL Server 2008从应用程序,其中有一个例子这个非常的场景。

结构简单概述如下。我假设相同的导入的表和字段名称如上面的问题。

所需的数据库对象:

- 首先:你需要一个用户定义的表类型
CREATE TYPE ImportStructure如表(场VARCHAR(MAX));
走 - 二:使用UDTT作为输入参数,以进口PROC。
- 因此,提交立法值参数(TVP)
CREATE PROCEDURE dbo.ImportData(
   @ImportTable dbo.ImportStructure READONLY


SET NOCOUNT ON; - 第一也许清除出表?
TRUNCATE TABLE dbo.DATAs;INSERT INTO dbo.DATAs(DatasField)
    选择字段
    FROM @ImportTable;走

C#应用程序code利用上面的SQL对象的下面。注意它是存储过程发起的文件内容读的是如何执行的,而不是填补了一个对象(如数据表),然后执行存储过程,在这个方法中。该存储过程的输入参数不是一个变量;它是一种方法, GetFileContents 的返回值。这种方法被调用时,的SqlCommand 要求的ExecuteNonQuery ,打开该文件,读取一行,并将该行的SQL的SqlDataRecord>通过的IEnumerable&LT服务器收益回报率结构,然后关闭文件。存储过程只是看到一个表变量,@ImportTable,可以尽快将访问的数据开始过来(注:数据不持续的时间很短,即使不是全部内容,在tempdb )。

System.Collections中使用;
使用System.Data这;
使用System.Data.SqlClient的;
使用System.IO;
使用Microsoft.SqlServer.Server;私有静态的IEnumerable<&的SqlDataRecord GT; GetFileContents()
{
   SqlMetaData [] = _TvpSchema新SqlMetaData [] {
      新SqlMetaData(场,SqlDbType.VarChar,SqlMetaData.Max)
   };
   的SqlDataRecord _DataRecord =新的SqlDataRecord(_TvpSchema);
   StreamReader的_FileReader = NULL;   尝试
   {
      _FileReader =新的StreamReader({}文件路径);      //读取一行,送一排
      而(!_FileReader.EndOfStream)
      {
         //你不应该需要调用_DataRecord =新的SqlDataRecord为
         //的SQL Server已经收到行时收益回报率被调用。
         //不同于BCP和BULK INSERT,您可以选择在这里创建一个字符串
         //调用的ReadLine()成字符串,执行操作(S)上/验证(S)
         //字符串,然后通过该字符串中的SetString(),或者如果无效丢弃。
         _DataRecord.SetString(0,_FileReader.ReadLine());
         产生回报_DataRecord;
      }
   }
   最后
   {
      _FileReader.Close();
   }
}

GetFileContents 上面的方法被用作存储过程的输入参数值,如下所示:

公共静态无效测试()
{
   SqlConnection的_Connection =新的SqlConnection({连接字符串});
   的SqlCommand _Command =新的SqlCommand(IMPORTDATA,_Connection);
   _Command.CommandType = CommandType.StoredProcedure;   的SqlParameter _TVParam =新的SqlParameter();
   _TVParam.ParameterName =@ImportTable;
   _TVParam.TypeName =dbo.ImportStructure;
   _TVParam.SqlDbType = SqlDbType.Structured;
   _TVParam.Value = GetFileContents(); //方法的返回值是流数据
   _Command.Parameters.Add(_TVParam);   尝试
   {
      _Connection.Open();      _Command.ExecuteNonQuery();
   }
   最后
   {
      _Connection.Close();
   }   返回;
}

其他注意事项:


  1. 进行一些修改,上面的C#code可以以适合于批量的数据。

  2. 通过细微的修改,上面的C#code可以适应多个字段发送(在所示的例子蒸汽数据...文章上面链接通过2场)。

  3. 您也可以操纵在proc的 SELECT 语句每条记录的值。

  4. 您也可以通过在proc使用WHERE条件筛选出的行。

  5. 您可以变量多次访问TVP表;它是只读的,但不是仅向前。

  6. 以上优点 SqlBulkCopy的

    1. SqlBulkCopy的是INSERT只,而使用TVP允许数据以任何方式使用:你可以叫合并;你可以删除基于某种条件;您可以拆分数据到多个表;等等。

    2. 由于一个TVP不是INSERT-而已,你并不需要一个单独的临时表的数据转储到。

    3. 您可以通过调用的ExecuteReader 而不是的ExecuteNonQuery 获取数据从数据库返回的。例如,如果有一个身份字段中的 DATAS 导入表,则可以添加输出子句的插入传回插入。[ID] (假设 ID 身份字段)的名称。或者你也可以传回一个完全不同的查询结果,或两者因为多个结果集可以通过 Reader.NextResult发送和访问()。使用时,获取的信息回从数据库中是不可能的 SqlBulkCopy的但这里还有几个问题上S.O.人想要做的正是这样的(至少与问候到新创建身份值)。

    4. 欲了解更多信息为什么有时对整个过程更快,即使从磁盘获取数据到SQL Server稍微慢一些,请参阅从SQL Server客户顾问团队本白皮书:<一href=\"http://blogs.msdn.com/b/sqlcat/archive/2013/09/23/maximizing-throughput-with-tvp.aspx\">Maximizing与TVP
    5. 吞吐量

I have a file (which has 10 million records) like below:

    line1
    line2
    line3
    line4
   .......
    ......
    10 million lines

So basically I want to insert 10 million records into the database. so I read the file and upload it to SQL Server.

C# code

System.IO.StreamReader file = 
    new System.IO.StreamReader(@"c:\test.txt");
while((line = file.ReadLine()) != null)
{
    // insertion code goes here
    //DAL.ExecuteSql("insert into table1 values("+line+")");
}

file.Close();

but insertion will take a long time. How can I insert 10 million records in the shortest time possible using C#?

Update 1:
Bulk INSERT:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(

     ROWTERMINATOR =' \n'
  );

My Table is like below:

DATAs
(
     DatasField VARCHAR(MAX)
)

but I am getting following error:

Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.

Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.

Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

Below code worked:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
    FIELDTERMINATOR = '\t',
    ROWTERMINATOR = '\n'
);

解决方案

Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.

Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.

A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.

Required database objects:

-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO

-- Second: Use the UDTT as an input param to an import proc.
--         Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
   @ImportTable    dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;

-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;

INSERT INTO dbo.DATAs (DatasField)
    SELECT  Field
    FROM    @ImportTable;

GO

C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, @ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).

using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;

private static IEnumerable<SqlDataRecord> GetFileContents()
{
   SqlMetaData[] _TvpSchema = new SqlMetaData[] {
      new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
   };
   SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
   StreamReader _FileReader = null;

   try
   {
      _FileReader = new StreamReader("{filePath}");

      // read a row, send a row
      while (!_FileReader.EndOfStream)
      {
         // You shouldn't need to call "_DataRecord = new SqlDataRecord" as
         // SQL Server already received the row when "yield return" was called.
         // Unlike BCP and BULK INSERT, you have the option here to create a string
         // call ReadLine() into the string, do manipulation(s) / validation(s) on
         // the string, then pass that string into SetString() or discard if invalid.
         _DataRecord.SetString(0, _FileReader.ReadLine());
         yield return _DataRecord;
      }
   }
   finally
   {
      _FileReader.Close();
   }
}

The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:

public static void test()
{
   SqlConnection _Connection = new SqlConnection("{connection string}");
   SqlCommand _Command = new SqlCommand("ImportData", _Connection);
   _Command.CommandType = CommandType.StoredProcedure;

   SqlParameter _TVParam = new SqlParameter();
   _TVParam.ParameterName = "@ImportTable";
   _TVParam.TypeName = "dbo.ImportStructure";
   _TVParam.SqlDbType = SqlDbType.Structured;
   _TVParam.Value = GetFileContents(); // return value of the method is streamed data
   _Command.Parameters.Add(_TVParam);

   try
   {
      _Connection.Open();

      _Command.ExecuteNonQuery();
   }
   finally
   {
      _Connection.Close();
   }

   return;
}

Additional notes:

  1. With some modification, the above C# code can be adapted to batch the data in.
  2. With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
  3. You can also manipulate the value of each record in the SELECT statement in the proc.
  4. You can also filter out rows by using a WHERE condition in the proc.
  5. You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
  6. Advantages over SqlBulkCopy:

    1. SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
    2. Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
    3. You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
    4. For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP

这篇关于我怎么可以插入一千万条记录在最短的时间内?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆