使用 SSIS,我如何找到人口最多的城市? [英] Using SSIS, How do I find the cities with the largest population?

查看:24
本文介绍了使用 SSIS,我如何找到人口最多的城市?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含如下信息的数据流任务:

省|城市 |人口-------------------------------安大略 |多伦多 |700万安大略 |伦敦 |300000魁北克 |魁北克 |300000魁北克 |蒙特利尔|600万

如何使用 Aggregate 转换获得每个省中人口最多的城市:

省|城市 |人口-------------------------------安大略 |多伦多 |700万魁北克 |蒙特利尔|600万

如果我将Province"设置为 Group-By 列,将Population"设置为Max"聚合,我该如何处理 City 列?

解决方案

完全同意@PaulStock 的观点,聚合最好留给源系统.SSIS 中的聚合是一个完全阻塞的组件,很像排序,我 这将有助于消除阻塞转换,至少在重要的地方,但您仍然需要支付双重处理源数据的成本.

将两个数据流拖到画布上.第一个将填充缓存连接管理器,并且应该是聚合发生的地方.

既然缓存中有聚合数据,请在主数据流中放置一个查找任务并对缓存执行查找.

常规查找标签

选择缓存连接管理器

映射适当的列

大获成功

脚本任务

我能想到的第三种方法,2005 年或 2008 年,是自己编写.作为一般规则,我尽量避免脚本任务,但在这种情况下它可能是有意义的.您需要将其设为 异步脚本转换 但只需在那里处理您的聚合.更多代码需要维护,但您可以省去重新处理源数据的麻烦.

最后,作为一般警告,我会调查关系对您的解决方案的影响.对于这个数据集,我预计像圭尔夫这样的东西会突然膨胀并与多伦多联系在一起,但如果确实如此,包裹应该怎么做?现在,安大略省都将导致 2 行,但这是预期的行为吗?当然,脚本允许您定义在平局的情况下会发生什么.您可能可以通过缓存正常"数据并将其用作查找条件并使用聚合来撤回其中一个关系,从而使 2008 年的解决方案完全站稳脚跟.2005 可能只是通过将聚合作为合并连接的左源来做同样的事情

编辑

Jason Horner 在他的评论中有一个好主意.另一种方法是使用多播转换并在一个流中执行聚合并将其重新组合在一起.我不知道如何让它与联合一起工作,但我们可以像上面那样使用排序和合并连接.这可能是一种更好的方法,因为它可以为我们省去重新处理源数据的麻烦.

I have a dataflow task with information that looks something like this:

Province | City    | Population
-------------------------------
Ontario  | Toronto | 7000000
Ontario  | London  |  300000
Quebec   | Quebec  |  300000
Quebec   | Montreal| 6000000

How do I use the Aggregate transformation to get the city with the largest population in each province:

Province | City    | Population
-------------------------------
Ontario  | Toronto | 7000000
Quebec   | Montreal| 6000000

If I set "Province" as the Group-By column and "Population" to the "Max" aggregate, what do I do with the City column?

解决方案

Completely agree with @PaulStock that aggregates are best left to source systems. An aggregate in SSIS is a fully blocking component much like a sort and I've already made my argument on that point.

But there are times when doing those operations in the source system just aren't going to work. The best I've been able to come up with is to basically double process the data. Yes, ick but I was never able to find a way to pass a column through unaffected. For Min/Max scenarios, I'd want that as an option but obviously something like a Sum would make it hard for the component to know what the "source" row it'd tie to.

2005

A 2005 implementation would look like this. Your performance is not going to be good, in fact a few orders of magnitude from good as you'll have all these blocking transforms in there in addition to having to reprocess your source data.

Merge join

2008

In 2008, you have the option of using the Cache Connection Manager which would help eliminate the blocking transformations, at least where it matters, but you're still going to have to pay the cost of double processing your source data.

Drag two data flows onto the canvas. The first will populate the cache connection manager and should be where the aggregate takes place.

Now that the cache has the aggregated data in there, drop a lookup task in your main data flow and perform a lookup against the cache.

General lookup tab

Select the cache connection manager

Map the appropriate columns

Great success

Script task

The third approach I can think of, 2005 or 2008, is to write it your own self. As a general rule, I try to avoid the script tasks but this is a case where it probably makes sense. You will need to make it an asynchronous script transformation but simply handle your aggregations in there. More code to maintain but you can save yourself the trouble of reprocessing your source data.

Finally, as a general caveat, I'd investigate what the impact of ties will do to your solution. For this data set, I would expect something like Guelph to suddenly swell and tie Toronto but if it did, what should the package do? Right now, both will result in 2 rows for Ontario but is that the intended behaviour? Script, of course, allows you to define what happens in the case of ties. You could probably stand the 2008 solution on its head by caching the "normal" data and using that as your lookup condition and using the aggregates to pull back just one of the ties. 2005 can probably do the same just by putting the aggregate as the left source for the merge join

Edits

Jason Horner had a good idea in his comment. A different approach would be to use a multicast transformation and perform the aggregation in one stream and bring it back together. I couldn't figure out how to make it work with a union all but we could use sorts and merge join much like in the above. This is probably a better approach as it saves us the trouble of reprocessing the source data.

这篇关于使用 SSIS,我如何找到人口最多的城市?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆