将查找转换部分缓存限制为加载的值 [英] Restrict Lookup Transformation Partial Cache to Loaded Values

查看:26
本文介绍了将查找转换部分缓存限制为加载的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 SSIS 将数据从平面文件加载到 SQL 表.平面文件包含新行和更新行.每次进程运行时,更新的行将影响 SQL 表的一小部分,由期间"列指定(例如,一个过程可能只影响期间 3、4 和 5).

I am using SSIS to load data from flat files to a SQL table. The flat files contain both new and updated rows. Each time the process is run, the updated rows will affect a small subset of the SQL table, specified by a 'period' column (e.g. one procedure may only affect periods 3, 4, and 5).

我正在使用查找转换将新行(查找无匹配输出)与现有行 s(查找匹配输出)分开.由于引用集和加载的数据集都非常大,我想使用部分缓存进行查找.是否可以将部分缓存查询修改为仅包含平面文件中包含的期间编号中的行?

I am using a Look-Up transformation to separate new rows (Lookup No Match Output) from existing row s(Lookup Match Output). Since both the reference set and the data set being loaded is extremely large, I would like to use partial caching for the lookup. Is it somehow possible to modify the partial caching query to only include rows from the period numbers included in the flat files?

例如,我的参考表可能包含第 1-10 期的数据,但我加载的平面文件可能只有第 3-5 期的数据.因此,我只想缓存时间段 3-5 的数据,因为我已经知道时间段 1-2 和 6-10 永远不会产生匹配.

For example, my reference table may contain data from periods 1-10, but my flat files being loaded may only have data from periods 3-5. Therefore, I only want to cache data from periods 3-5, since I already know periods 1-2 and 6-10 will never produce a match.

推荐答案

不要在下拉列表中使用表选择器,除非你需要每一行的每一列,否则你不应该这样做,写你的查询只拉回匹配或扩充现有数据所需的列.在您的情况下,您将需要添加一个有点挑剔的过滤器.

Instead of using the table selector in the drop down, which you should never do unless you need every column from every row, write your query to only pull back the columns you need for either matching or augmenting the existing data. In your case, you're going to need to add a filter which is a bit persnickety.

我发现的最佳方法是在 String 类型的变量中编写查找查询.在其中,我将构建查询并应用所需的过滤器.下面,你会看到我定义了两个变量.一个 int 将用作我的过滤器,然后是使用它的查询本身.

The best approach I've found is to write the lookup query in a variable of type String. In it, I will build the query and apply the needed filter. Below, you see I defined two variables. One an int which will serve as my filter and then the query itself which uses it.

我的 SourceQuery 变量上的表达式是

The expression on my SourceQuery Variable is

"SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 
WHERE D.rn <= " + (DT_WSTR, 10) @[User::MaxID]

我的数据流看起来像

我有我的来源,它进行了查找,并根据匹配的结果转到两个存储桶之一.我的源查询只生成数字 1 到 10,而查找是生成从 2 到 20 的偶数的查询.

I have my source and it hits a lookup and based on matched results it goes to one of the two buckets. My source query just generates the numbers 1 to 10 and the lookup is a query that generates even numbers from 2 to 20.

在设计时,该查询看起来像

During design time, that query looks like

SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 

正常运行会导致存储桶之间的分配比例为 50/50

A normal run would result in a 50/50 split between the buckets

当然,我们的目标是让查找查询采用类似源组件之一的参数,但您很快就会发现

The goal of course is to make the lookup query take a parameter like one of the source components but you'd quickly discover that

SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 
WHERE D.rn > ?

不会飞.相反,您必须返回到控制流并选择数据流,右键单击并选择属性.在数据流窗口中,转到表达式并单击省略号 (...)

doesn't fly. Instead, you have to go back out to the Control Flow and select the Data Flow, Right click and select Properties. In the window for your data flow, go to Expressions and click the ellipses (...)

将有一个属性命名为您的查找任务.分配使用表达式的变量使其全部动态化,瞧,maxid 为 6 我只找到 3 个匹配

There will be a property named as your Lookup task. Assign the Variable that uses the expression to make it all dynamic and voila, with a maxid of 6 I only find 3 matches

最后一点,部分缓存可能是您要找的,也可能不是.这是一个实际的查找设置,用于控制它如何平衡查找数据的成本与在本地缓存它的成本.完整缓存会将适用范围的所有指定列放入内存中,这就是您只想指定所需列的原因.如果您可以将其缩减为几个细小的列,即使是数百万行,您也可能不会感到痛苦.

A final note, the Partial Cache may or may not be what you're looking for. That's an actual lookup setting that controls how it balances the cost of lookup data versus caching it locally. A full cache will drop all the specified columns for the applicable range into memory which is why you only want to specify the columns you need. If you can get it down to a few skinny columns and even if it's millions of rows, you probably aren't going to feel pain.

这篇关于将查找转换部分缓存限制为加载的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆