不同事务大小的RBAR与基于集合的处理的性能 [英] Performance for RBAR vs. set-based processing with varying transactional sizes

查看:153
本文介绍了不同事务大小的RBAR与基于集合的处理的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

传统观点认为,基于集的表处理始终应优先于RBAR,尤其是当表变大和/或需要更新许多行时.

It is conventional wisdom that set based processing of tables should always be preferred over RBAR - especially when the tables grow larger and/or you need to update many rows.

但是那总是成立吗?我经历过很多情况-在不同的硬件上-基于集合的处理显示了时间消耗的指数增长,而将相同的工作负载分成较小的块则呈现线性增长.

But does that always hold? I have experienced quite a few situations - on different hardware - where set-based processing shows exponential growth in time consumption, while splitting the same workload into smaller chunks gives linear growth.

我认为被证明是完全错误的(如果我遗漏了一些明显的东西)会很有趣,或者,如果不是这样,那么知道何时分配工作负载值得付出很好的工作会很有意思.然后确定哪些指标有助于决定使用哪种方法.我个人希望以下组件会很有趣:

I think it would be interesting either to be proven totally wrong - if I'm missing something obvious - or, if not, it would be very good to know when splitting the workload is worth the effort. And subsequently identifying what indicators help make the decision of which approach to use. I'm personally expecting the following components to be interesting:

  • 工作量
  • 日志文件的大小和增长
  • RAM量
  • 磁盘系统的速度

还有其他吗? CPU/CPU核心数量?

Any other? Number of CPUs/CPU cores?

示例1:我有一个1200万行表,并且必须用来自另一个表的数据更新每行中的一个或两个字段.如果我通过一个简单的UPDATE进行此操作,则在我的测试箱上大约需要30分钟.但是,如果将它分成十二个块,我将在约24分钟内完成-即:

Example 1: I have a 12 million row table and I have to update one or two fields in each row with data from another table. If I do this in one simple UPDATE, this takes ~30 minutes on my test box. But I'll be done in ~24 minutes if I split this into twelve chunks - ie.:

WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...

示例2:是一个200+百万行的表,实际上还需要对所有行进行几次计算.如果一整套完成所有任务,我的设备将运行三天,甚至没有完成.如果我编写一个简单的C#来执行完全相同的SQL,但是附加了WHERE子句以一次将事务大小限制为10万行,那么它将在大约14个小时内完成.

Example 2: Is a 200+ million rows table that also need to have several calculations done to practically all rows. If a do the full set all in one, my box will run for three days and not even then be done. If I write a simple C# to execute the exact same SQL, but with WHERE-clauses appended to limit transaction size to 100k rows at a time, it'll be done in ~14 hours.

记录下来:我的结果来自相同的数据库,基于相同的物理硬件,具有更新的统计信息,索引没有更改,简单的恢复模型等.

For the record: My results are from the same databases, resting on the same physical hardware, with statistics updated, no changes in indexes, Simple recovery model, etc.

不,我没有尝试过'true'RBAR,尽管我可能应该这样做-尽管那只是看这将花费多长时间.

And no, I haven't tried 'true' RBAR, although I probably should - even though it would only be to see how long that would really take.

推荐答案

不,没有规则规定基于集合的总是更快.我们使用游标是有原因的(不要误以为while循环或某种其他类型的循环与游标确实有很大的不同). Itzik Ben-Gan演示了一些游标更好的情况,特别是对于运行总计问题.在某些情况下,您还会描述要在何处尝试更新1200万行,并且由于内存限制,日志使用或其他原因,对于SQL而言,将其作为单个操作来处​​理而又不必溢出到tempdb或依靠一个临时数据库实在太多了.由于没有足够快地获得更好的计划而导致提前终止计划的次优计划.

No, there is no rule that set-based is always faster. We have cursors for a reason (and don't be fooled into believing that a while loop or some other type of looping is really all that different from a cursor). Itzik Ben-Gan has demonstrated a few cases where cursors are much better, particularly for running totals problems. There are also cases you describe where you're trying to update 12 million rows and due to memory constraints, log usage or other reasons it's just too much for SQL to handle as a single operation without having to spill to tempdb, or settle on a sub-optimal plan from early termination due to not getting a more optimal plan quick enough.

光标说唱不好的原因之一是人们很懒,只是说:

One of the reasons cursors get a bad rap is that people are lazy and just say:

DECLARE c CURSOR FOR SELECT ...

当他们几乎总是应该说:

When they almost always should be saying:

DECLARE c CURSOR 
    LOCAL FORWARD_ONLY STATIC READ_ONLY 
    FOR SELECT ...

这是因为由于各种原因,这些额外的关键字使光标更有效.根据文档,您可能希望其中一些选项是多余的,但在我的测试中并非如此.参见

This is because those extra keywords make the cursor more efficient for various reasons. Based on the documentation you would expect some of those options to be redundant, but in my testing this is not the case. See this blog post of mine and this blog post from fellow SQL Server MVP Hugo Kornelis for more details.

总而言之,在大多数情况下,您最好的选择是基于设置(或如上所述,至少基于块式设置).但是对于一次性的管理任务(我希望您完成了1200万行的更新),有时候写一个游标比花很多精力来构建一个能生成适当计划的最佳查询要容易得多/效率更高.对于将在应用程序范围内像常规操作一样大量运行的查询,需要付出更多的努力来尝试优化基于集合的操作(请记住,您可能仍然会以游标结尾).

That all said, in most cases your best bet is going to be set-based (or at least chunky set-based as you described above). But for one-off admin tasks (which I hope your 12-million row update is), it is sometimes easier / more efficient to just write a cursor than to spend a lot of effort constructing an optimal query that produces an adequate plan. For queries that will be run a lot as normal operations within the scope of the application, those are worth more effort to try to optimize as set-based (keeping in mind that you may still end up with a cursor).

这篇关于不同事务大小的RBAR与基于集合的处理的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆