SSIS 仅提取 Delta 更改 [英] SSIS only extract Delta changes
问题描述
经过一些建议.我正在使用 SSIS\SQL Server 2014.我有一个夜间 SSIS 包,它将非 SQL Server 数据库中的数据提取到一个表中(SQL 表每次都事先被截断),然后我从该表中提取以创建一个每日 csv 文件.
After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
展望未来,我只想每天将已更改的记录(即 Delta)提取到 csv.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
最好的方法是什么?我想在 SSIS 中使用 CDC,但是由于我每次在初始加载之前都截断 SQL 表,这是最好的方法吗?或者我是否需要在初始加载时使用 SQL 中的主表,然后导入到另一个表中并只提取不同的地方?有关信息,SQL 中的表包含一个主键.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
我只想仔细检查一下,因为 CDC 假设这些表都在 SQL Server 中,而我的数据首先来自 SQL Server 外部.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
感谢您的帮助.
推荐答案
该表上的主键是您的救星.很明显,您将不同数据拉入的 SQL Server 数据库不会知道从一个表刷新到下一个哪些记录已更改,但是如果您添加两个附加表,并使用附加列修改现有表,它应该能够通过利用 哈希字节.
The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
在本例中,我将新表命名为 SentRows
,但您可以在实践中使用更有意义的名称.我们将调用旧表中的新列 HashValue
.
For this example, I'll call the new table SentRows
, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue
.
将 HashValue
列作为 varbinary
数据类型添加到您的表中.NOT NULL
也一样.
Add the column HashValue
to your table as a varbinary
data type. NOT NULL
as well.
创建您的 SentRows
表,其中包含主表主键中所有列的列,以及 HashValue
列.
Create your SentRows
table with columns for all the columns in the main table's primary key, plus the HashValue
column.
创建一个与主表结构相同的 RowsToSend
表,包括 HashValue
.
Create a RowsToSend
table that's structurally identical to your main table, including the HashValue
.
通过将 HASHBYTES
应用于表中的所有非键列,修改您的查询以创建 HashValue
.(这会非常乏味.抱歉.)
Modify your queries to create the HashValue
by applying HASHBYTES
to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
发送您的完整数据集.
现在将所有键值和 HashValue
移动到 SentRows
表.截断主表.
Now move all of the key values and HashValue
s to the SentRows
table. Truncate your main table.
在下一次拉取时,将 SentRows
中的键值和 HashValue
与主表中的新数据进行比较.
On the next pull, compare the key values and HashValue
s from SentRows
to the new data in the main table.
主键匹配 + 哈希匹配 = 未更改行
Primary key match + hash match = Unchanged row
主键匹配 + 哈希不匹配 = 更新行
Primary key match + hash mismatch = Updated row
传入数据中的主键但现有数据集中缺失=新行
Primary key in incoming data but missing from existing data set = New row
主键不在传入数据中,而是在现有数据集中 = 已删除行
Primary key not in incoming data but in existing data set = Deleted row
提取您需要发送到 RowsToSend
表的任何更改.
Pull out any changes you need to send to the RowsToSend
table.
从 RowsToSend
发送更改.
将键值和 HashValue
移动到您的 SentRows
表中.更新已更改键值的哈希值,插入新行,并决定如何处理删除(如果必须处理删除).
Move the key values and HashValue
s to your SentRows
table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
截断 SentRows
表,为明天做好准备.
Truncate the SentRows
table to get ready for tomorrow.
如果您愿意(如果您愿意,稍后您会感谢自己)将计算列添加到 SentRows
表中,默认为 GETDATE()
,即会告诉您何时添加该行.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows
table with default of GETDATE()
, which will tell you when the row was added.
你走吧.从现在开始,只有增量.
And away you go. Nothing but deltas from now on.
编辑 2019-10-31:
循序渐进(或 TL;DR):
Step by step (or TL;DR):
1) 刷新并填充MainTable
.
2) 将 MainTable
上的键和哈希与 SentRows
上的键和哈希进行比较,以识别新的/更改的行.
2) Compare keys and hashes on MainTable
to keys and hashes on SentRows
to identify new/changed rows.
3) 将新的/更改的行移动到 RowsToSend
.
3) Move new/changed rows to RowsToSend
.
4) 发送 RowsToSend
中的行.
4) Send the rows that are in RowsToSend
.
5) 将所有行从 RowsToSend
移动到 SentRows
.
5) Move all the rows from RowsToSend
to SentRows
.
6) 截断 RowsToSend
.
这篇关于SSIS 仅提取 Delta 更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!