SSIS:没有重复行的 SQL 的平面文件源 [英] SSIS: Flat File Source to SQL without Duplicate Rows

查看:21
本文介绍了SSIS:没有重复行的 SQL 的平面文件源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个(有点大)平面文件 (csv).我正在尝试使用 SSIS 包将其导入到我的 SQL Server 表中.没有什么特别的,它是一个普通的进口.问题是,超过 50% 的行是重复的.

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.

例如数据:

Item Number    |    Item Name     |     Update Date
ITEM-01        | First Item       | 1-Jan-2013
ITEM-01        | First Item       | 5-Jan-2013
ITEM-24        | Another Item     | 12-Mar-2012
ITEM-24        | Another Item     | 13-Mar-2012
ITEM-24        | Another Item     | 14-Mar-2012

现在我需要使用这些数据创建我的主条目记录表,正如您所看到的,由于更新日期,数据是重复的.这可以保证文件将始终按项目编号排序.所以我需要做的只是检查如果下一个项目编号 = 上一个项目编号,则不要导入此行.

Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.

我在 SSIS 包中使用了 Sort with Remove Duplicate,但它实际上是在尝试对所有无用的行进行排序,因为行已经排序.另外,对太多行进行排序需要很长时间.

I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.

那还有什么办法吗?

推荐答案

有几种方法可以做到这一点.

There are a couple of approaches you can take to do this.

Item NumberItem Name 分组,然后对Update Date 执行聚合操作.根据您上面提到的逻辑,Minimum 操作应该可以工作.为了使用 Minimum 操作,您需要将 Update Date 列转换为日期(不能在细绳).该转换可以在数据转换转换中完成.以下是这将是什么样子的内容:

Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:

本质上,您可以实现上面提到的逻辑:

Essentially, you could implement the logic you mentioned above:

如果下一个项目编号 = 上一个项目编号,则不要导入此项目线

首先,您必须适当地配置脚本组件(以下步骤假设您没有重命名默认输入和输出名称):

First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):

  1. 选择转换作为脚本组件类型
  2. 在数据流中的平面文件源之后添加脚本组件:

  1. Select Transformation as the Script Component type
  2. Add the Script Component after the Flat File Source in your Data Flow:

输入列下,选择所有列:

输入和输出下,选择输出0,并将SynchronousInputID属性设置为None

Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None

现在手动将列添加到输出 0 以匹配输入 0 中的列(不要忘记设置数据类型):

Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (!Row.ItemNumber.Equals(previousItemNumber))
        {
            Output0Buffer.AddRow();
            Output0Buffer.ItemName = Row.ItemName;
            Output0Buffer.ItemNumber = Row.ItemNumber;
            Output0Buffer.UpdateDate = Row.UpdateDate;
        }  

        previousItemNumber = Row.ItemNumber;
    }

    private string previousItemNumber = string.Empty;

这篇关于SSIS:没有重复行的 SQL 的平面文件源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆