如何使用 SSIS 将大型平面文件加载到数据库表中? [英] How can I load a large flat file into a database table using SSIS?

查看:27
本文介绍了如何使用 SSIS 将大型平面文件加载到数据库表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定它是如何工作的,所以我正在寻找正确的解决方案.我认为 SSIS 是正确的方法,但我以前从未使用过

I'm not sure how it works so I'm looking for the right solution. I think SSIS is the right way to go but I have never used it before

每天早上,我都会收到一个包含 80 万条记录的制表符分隔文件.我需要将它加载到我的数据库中:

Every morning, I get a tab delimited file with 800K records. I need to load it into my database:

  1. 从 ftp 或本地获取文件
  2. 首先,我需要从数据库中删除新文件中不存在的文件;
    • 如何比较 tsql 中的数据
    • 我应该在哪里加载制表符分隔文件中的数据以便与文件进行比较?我应该使用临时表吗?ItemID 是表中的唯一列.
  1. Get file from ftp or local
  2. First, I need to delete the one which not exists in new file from database;
    • How can I compare data in tsql
    • Where should I load data from tab delimited file in order to compare it with the file? Should I use a temp table? ItemID is the unique column in the table.

不要忘记该文件包含 800K 条记录.

Don't forget that the file contains 800K records.

ID  ItemID  ItemName  ItemType
--  ------  --------  --------
 1  2345    Apple     Fruit
 2  4578    Banana    Fruit

我该如何解决这个问题?

How can I approach this problem?

推荐答案

是的,SSIS 可以执行您在问题中指定的要求.以下示例应该让您了解如何完成.示例使用 SQL Server 作为后端.下面提供了对封装执行的一些基本测试场景.抱歉回复太长.

Yes, SSIS can perform the requirements that you have specified in the question. Following example should give you an idea of how it can be done. Example uses SQL Server as the back-end. Some of the basic test scenarios performed on the package are provided below. Sorry for the lengthy answer.

分步过程:

  1. 在 SQL Server 数据库中,创建两个表,分别是 dbo.ItemInfodbo.Staging.脚本 部分下提供了创建表查询.这些表的结构显示在屏幕截图 #1 中.ItemInfo 将保存实际数据,Staging 表将保存临时数据以比较和更新实际记录.这两个表中的 Id 列都是自动生成的唯一标识列.ItemInfo 表中的IsProcessed 列将用于识别和删除不再有效的记录.

  1. In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.

创建一个 SSIS 包并创建 5 个变量,如屏幕截图 #2 所示.我为制表符分隔的文件使用了 .txt 扩展名,因此变量 FileExtension 中的值是 *.txt.FilePath 变量将在运行时赋值.FolderLocation 变量表示文件所在的位置.SQLPostLoadSQLPreLoad 变量表示在预加载和后加载操作期间使用的存储过程.脚本 部分下提供了这些存储过程的脚本.

Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.

创建指向 SQL Server 数据库的 OLE DB 连接.创建平面文件连接,如屏幕截图 #3 和 #4 所示.平面文件连接列部分包含列级信息.屏幕截图 #5 显示列数据预览.

Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.

配置控制流任务,如屏幕截图 #6 所示.配置任务Pre LoadPost LoadLoop Files,如截图#7 - #10.Pre Load 将截断临时表并将 ItemInfo 表中所有行的 IsProcessed 标志设置为 false.Post Load 将更新更改并删除数据库中未在文件中找到的行.请参阅这些任务中使用的存储过程以了解这些 Execute SQL 任务中正在执行的操作.

Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.

双击 Load Items 数据流任务并按屏幕截图 #11 所示对其进行配置.Read File 是配置为使用平面文件连接的平面文件源.Row Count 是派生列转换,其配置显示在 screenshto #12 中.Check Exist 是一种查找转换,其配置显示在屏幕截图 #13 - #15 中.Lookup No Match Output 被重定向到左侧的 Destination Split.查找匹配输出被重定向到左侧的Staging Split.Destination SplitStaging Split 具有与屏幕截图 #16 中所示完全相同的配置.目的地和登台表有 9 个不同目的地的原因是为了提高包的性能.

Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.

所有目标任务 0 - 8 都配置为将数据插入表 dbo.ItemInfo,如屏幕截图 #17 所示.所有暂存任务 0 - 8 都配置为将数据插入 dbo.Staging,如屏幕截图 #18 所示.

All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.

在平面文件连接管理器上,将 ConnectionString 属性设置为使用变量 FilePath,如屏幕截图 #19 所示.这将使包能够在循环遍历文件夹中的每个文件时使用变量中设置的值.

On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.

测试场景:

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.

  1. 程序包是在配备至强单核 CPU 2.5GHz 和 3.00 GB RAM 的 64 位机器上执行的.

  1. Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.

加载了一个包含 100 万行 的平面文件.包在大约 2 分 47 秒内执行.请参阅屏幕截图 #20 和 #21.

Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.

使用测试查询部分下提供的查询来修改数据,以模拟在包的第二次运行期间更新、删除和创建新记录.

Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.

在数据库中执行以下查询后加载包含 100 万行 的相同文件.包在大约 1 分 35 秒内执行.请参阅屏幕截图 #22 和 #23.请注意屏幕截图 #22 中重定向到目标和暂存表的行数.

Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.

希望有所帮助.

测试查询:.

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

平面文件连接列.

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

脚本:(创建表和存储过程).

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

屏幕截图 #1:

屏幕截图 #2:

屏幕截图 #3:

屏幕截图 #4:

屏幕截图 #5:

屏幕截图 #6:

屏幕截图 #7:

截图 #8:

屏幕截图 #9:

屏幕截图 #10:

屏幕截图 #11:

屏幕截图 #12:

屏幕截图 #13:

屏幕截图 #14:

屏幕截图 #15:

屏幕截图 #16:

屏幕截图 #17:

屏幕截图 #18:

屏幕截图 #19:

屏幕截图 #20:

屏幕截图 #21:

屏幕截图 #22:

屏幕截图 #23:

这篇关于如何使用 SSIS 将大型平面文件加载到数据库表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆