在 SSIS 中查找更新的记录——散列还是不散列? [英] Finding updated records in SSIS -- to hash or not to hash?

查看:25
本文介绍了在 SSIS 中查找更新的记录——散列还是不散列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 SSIS 将数据从 DB2 数据库中的表迁移到我们的 SQL Server 数据库.我从中提取数据的表包含相当数量的数据——略少于 100,000 条记录;但是,它也有 46 列.

I'm working on migrating data from a table in a DB2 database to our SQL Server database using SSIS. The table that I am pulling data from contains a respectable amount of data--a little less than 100,000 records; but, it also has 46 columns.

我只想更新需要更新的行,因此我得出的结论是,我可以使用查找转换并检查所有 46 列并重定向不匹配"以在 SQL 表上更新.或者,我可以在我的数据任务流开始时读取数据后对数据集中的每一行进行散列,然后,随后在确定行是否相等时使用散列值作为比较.

I only want to update the rows that NEED to be updated, and so I came to conclusion that I could either use a Lookup Transformation and check all 46 columns and redirect the "no matches" to be updated on the SQL table. Or, I could hash each row in the datasets after I read the data in at the beginning of my data task flow, and then, subsequently, use the hash values as a comparison later on when determining if the rows are equal or not.

我的问题是:走哪条路更好?我喜欢散列它们,但我不确定这是否是最佳途径.有没有人有任何智慧的珍珠想要分享?

My question would be: Which is the better route to take? I like hashing them, but I'm not sure if that is the best route to take. Does anyone have any pearls of wisdom they'd like to share?

推荐答案

为什么不能两者兼而有之?

Why not both?

一般来说,在进行增量加载时,我们会寻找两件事:这是否存在?如果存在,是否已更改.如果只有一列,那就微不足道了.当有很多列需​​要检查时,这会变得非常痛苦,尤其是当您使用 SSIS 来映射所有这些列和/或必须处理对 NULL 的担忧时.

Generally speaking, there are two things we look for when doing an incremental load: Does this exist? If it exists, has it changed. If there's a single column, it's trivial. When there are many columns to check, that becomes quite the pain, especially if you're using SSIS to map all those columns and/or have to deal with worrying about NULLs.

我通过作弊解决了多列问题 - 我在所有表中创建了两列:HistoricalHashKey 和 ChangeHashKey.历史哈希键将是所有业务键.更改哈希键是所有其余的材料列(我会排除审计列之类的内容).我们没有将串联的值直接存储在我们的哈希列中.相反,我们将把它算出来"并应用一种称为 SHA-1 的散列算法.该算法将获取所有输入列并返回一个 20 字节的输出.

I solve the multicolumn problem by cheating - I create two columns in all my tables: HistoricalHashKey and ChangeHashKey. Historical hash key will be all the business keys. Change hash key is all the rest of the material columns (I'd exclude things like audit columns). We are not storing the concatenated values directly in our hash columns. Instead, "we're going Math the stuff out of it" and apply a hashing algorithm called SHA-1. This algorithm will take all the input columns and return a 20 byte output.

使用这种方法有三个注意事项.您必须每次都以相同的顺序连接列.这些将区分大小写.尾随空间很重要.就是这样.

There are three caveats to using this approach. You must concatenate the columns in the same order every time. These will be case sensitive. Trailing space is significant. That's it.

在您的表中,您可以将这两列添加为 binary(20) NOT NULL.

In your tables, you would add those the two columns as binary(20) NOT NULL.

你的控制流看起来像这样

Your control flow would look something like this

你的数据流是这样的

(假设我来自 Adventureworks2014,Production.Product)我将使用 SQL Server 2012+ 中的 CONCAT 函数,因为它将所有数据类型提升为字符串并且是 NULL 安全的.

(Assume I'm sourced from Adventureworks2014, Production.Product) I'm going to use the CONCAT function from SQL Server 2012+ as it promotes all data types to string and is NULL safe.

SELECT
    P.ProductID
,   P.Name
,   P.ProductNumber
,   P.MakeFlag
,   P.FinishedGoodsFlag
,   P.Color
,   P.SafetyStockLevel
,   P.ReorderPoint
,   P.StandardCost
,   P.ListPrice
,   P.Size
,   P.SizeUnitMeasureCode
,   P.WeightUnitMeasureCode
,   P.Weight
,   P.DaysToManufacture
,   P.ProductLine
,   P.Class
,   P.Style
,   P.ProductSubcategoryID
,   P.ProductModelID
,   P.SellStartDate
,   P.SellEndDate
,   P.DiscontinuedDate

,   P.rowguid
,   P.ModifiedDate

-- Hash my business key(s)
,   CONVERT(binary(20), HASHBYTES('MD5',
            CONCAT
            (
            -- Having an empty string as the first argument
            -- allows me to simplify building of column list
                ''
            ,   P.ProductID
            )
        )
    ) AS HistoricalHashKey

-- Hash the remaining columns
,   CONVERT(binary(20), HASHBYTES('MD5',
            CONCAT
            (      
                ''
            ,   P.Name
            ,   P.ProductNumber
            ,   P.MakeFlag
            ,   P.FinishedGoodsFlag
            ,   P.Color
            ,   P.SafetyStockLevel
            ,   P.ReorderPoint
            ,   P.StandardCost
            ,   P.ListPrice
            ,   P.Size
            ,   P.SizeUnitMeasureCode
            ,   P.WeightUnitMeasureCode
            ,   P.Weight
            ,   P.DaysToManufacture
            ,   P.ProductLine
            ,   P.Class
            ,   P.Style
            ,   P.ProductSubcategoryID
            ,   P.ProductModelID
            ,   P.SellStartDate
            ,   P.SellEndDate
            ,   P.DiscontinuedDate
            )
        )
    ) AS ChangeHashKey
FROM
    Production.Product AS P;

LKP 检查存在

此查询将从我们的参考表中拉回存储的 HistoricalHashKey 和 ChangeHashKey.

LKP Check Existence

This query will pull back the stored HistoricalHashKey and ChangeHashKey from our reference table.

SELECT
    DP.HistoricalHashKey
,   DP.ChangeHashKey
FROM
    dbo.DimProduct AS DP;

此时,通过比较HistoricalHashKeys来判断该行是否存在就很简单了.如果匹配,我们希望将 ChangeHashKey 拉回到我们的数据流中.按照惯例,我将此命名为 lkp_ChangeHashKey 以区别于源 ChangeHashKey.

At this point, it's a simple matter to compare the HistoricalHashKeys to determine whether the row exists. If we match, we want to pull back the ChangeHashKey into our Data Flow. By convention, I name this lkp_ChangeHashKey to differentiate from the source ChangeHashKey.

条件拆分也得到了简化.两个 Change Hash 键匹配(无变化)或不匹配(变化).该表达式将是

The conditional split is also simplified. Either the two Change Hash keys match (no change) or they don’t (changed). That expression would be

ChangeHashKey == lkp_ChangeHashKey

OLE_DST 分阶段更新

与其使用 OLE DB 命令,不如创建一个专用表来保存需要更新的行.OLE DB 命令的伸缩性不如它在后台发出单例更新命令的好.

OLE_DST StagedUpdates

Rather than use the OLE DB Command, create a dedicated table for holding the rows that need to be updated. OLE DB Command does not scale well as behind the scenes it issues singleton update commands.

数据流完成后,所有需要更新的数据都会在我们的暂存表中.此执行 SQL 任务只是更新与我们的业务键匹配的现有数据.

After the data flow is complete, all the data that needs updating will be in our staging table. This Execute SQL Task simply updates the existing data matching on our business keys.

UPDATE
    TGT
SET
    Name = SRC.name
,   ProductNumber = SRC.
FROM
    dbo.DimProduct AS TGT
    INNER JOIN
        Stage.DimProduct AS SRC
        ON SRC.HistoricalHashKey = TGT.HistoricalHashKey;
        -- If clustered on a single column and table is large, this will yield better performance
        -- ON SRC.DimProductSK = TGT.DimProductSK;

来自评论

为什么我要使用专用的 INSERTUPDATE 语句,因为我们有闪亮的 MERGE?除了不容易记住语法之外,SQL Server 实现还可以有一些...... 意外后果.它们可能是角落ish 案例,但我宁愿不使用我提供的解决方案遇到它们.显式的 INSERT 和 UPDATE 语句为我提供了解决方案中我想要和需要的细粒度控制.我喜欢 SQL Server,认为它是一个很棒的产品,但它们奇怪的语法加上已知的错误使我无法在认证考试之外的任何地方使用 MERGE.

From the comments

Why do I use dedicated INSERT and UPDATE statements since we have the shiny MERGE? Besides not remembering the syntax as easily, the SQL Server implementation can have some ... unintended consequences. They may be cornerish cases but I'd rather not run into them with the solutions I deliver. Explicit INSERT and UPDATE statements give me the fine grained control I want and need in my solutions. I love SQL Server, think it's a fantastic product but they weird syntax coupled with known bugs keeps me from using MERGE anywhere but a certification exam.

这篇关于在 SSIS 中查找更新的记录——散列还是不散列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆