条件聚合性能 [英] Conditional aggregation performance

查看:49
本文介绍了条件聚合性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们有以下数据

 IF OBJECT_ID('dbo.LogTable', 'U') IS NOT NULL  DROP TABLE dbo.LogTable

 SELECT TOP 100000 DATEADD(day, ( ABS(CHECKSUM(NEWID())) % 65530 ), 0) datesent 
 INTO [LogTable]
 FROM    sys.sysobjects
 CROSS JOIN sys.all_columns

我想统计行数、去年行数和过去十年行数.这可以使用条件聚合查询或使用子查询来实现,如下所示

I want to count the number of rows, the number of last year rows and the number of last ten years rows. This can be achieved using conditional aggregation query or using subqueries as follows

-- conditional aggregation query
SELECT
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-1,GETDATE())
             THEN 1 ELSE 0 END) AS last_year_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-10,GETDATE())
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable


-- subqueries
SELECT
(
    SELECT count(*) FROM LogTable 
) all_cnt, 
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-1,GETDATE())
) last_year_cnt,
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-10,GETDATE())
) last_ten_year_cnt

如果您执行查询并查看查询计划,您会看到类似

If you perform the queries and look on query plans then you see something like

显然,第一个解决方案具有更好的查询计划、成本估算,甚至 SQL 命令看起来更简洁和花哨.但是,如果您使用 SET STATISTICS TIME ON 测量查询的 CPU 时间,我会得到以下结果(我已经测量了几次,结果大致相同)

Clearly, the first solution has much nicer query plan, cost estimation and even the SQL command looks more concise and fancy. However, if you measure the CPU time of the query using SET STATISTICS TIME ON I get the following results (I have measured several times with approximately the same results)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 41 ms.

(1 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 26 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

因此,与使用条件聚合的解决方案相比,第二种解决方案的性能略好(或相同).如果我们在 datesent 属性上创建索引,差异会变得更加明显.

Therefore, the second solution has slightly better (or the same) performance than the solution using conditional aggregation. The difference becomes more evident if we create the index on datesent attribute.

CREATE INDEX ix_logtable_datesent ON dbo.LogTable(DateSent)

然后第二个解决方案开始使用Index Seek而不是Table Scan,它的查询CPU时间性能在我的电脑上下降到16ms.

Then the second solution starts to use Index Seek instead of Table Scan and its query CPU time performance drops to 16ms on my computer.

我的问题有两个:(1)为什么条件聚合解决方案至少在没有索引的情况下不优于子查询解决方案,(2)是否可以为条件聚合解决方案创建索引"(或重写条件聚合查询)以避免扫描,或者如果我们关心性能,条件聚合通常不适合?

My questions are two: (1) why the conditional aggregation solution does not outperform the subquery solution at least in the case without index, (2) is it possible to create 'index' for the conditional aggregation solution (or rewrite the conditional aggregation query) in order to avoid scan, or is conditional aggregation generally unsuitable if we are concerned about performance?

旁注:我可以说,这种情况对于条件聚合非常乐观,因为我们选择了所有行的数量,这总是导致使用扫描的解决方案.如果不需要所有行的数量,则带有子查询的索引解决方案不进行扫描,而带有条件聚合的解决方案无论如何都必须执行扫描.

Sidenote: I can say, that this scenario is quite optimistic for conditional aggregation since we select the number of all rows which always leads to a solution using scan. If the number of all rows is not needed, then indexed solution with subqueries has no scan, whereas, the solution with conditional aggregation has to perform the scan anyway.

编辑

Vladimir Baranov 基本上回答了第一个问题(非常感谢).然而,第二个问题仍然存在.我可以在 StackOverflow 上看到经常使用条件聚合解决方案的答案,它们吸引了很多注意力,被认为是最优雅和清晰的解决方案(有时被提议为最有效的解决方案).因此,我将这个问题稍微概括一下:

Vladimir Baranov basically answered the first question (thank you very much). However, the second question remains. I can see on StackOverflow answers using conditional aggregation solutions quite offten and they attract a lot of attention being accepted as the most elegant and clear solution (and sometimes being proposed as the most efficient solution). Therefore, I will slightly generalize the question:

你能给我举个例子,其中条件聚合明显优于子查询解决方案吗?

为简单起见,我们假设不存在物理访问(数据在缓冲区缓存中),因为今天的数据库服务器无论如何都将大部分数据保留在内存中.

For simplicity let us assume that physical accesses are not present (data are in Buffer cache) since the today database servers remain most of their data in the memory anyway.

推荐答案

简短总结

  • 子查询方法的性能取决于数据分布.
  • 条件聚合的性能不依赖于数据分布.
  • 子查询方法可以比条件聚合更快或更慢,这取决于数据分布.

    Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

    当然,如果表有合适的索引,那么子查询可能会从中受益,因为索引将允许只扫描表的相关部分而不是全扫描.拥有合适的索引不太可能对条件聚合方法有显着的好处,因为它无论如何都会扫描完整的索引.唯一的好处是如果索引比表更窄,引擎将不得不将更少的页面读入内存.

    Naturally, if the table has a suitable index, then subqueries are likely to benefit from it, because index would allow to scan only the relevant part of the table instead of the full scan. Having a suitable index is unlikely to significantly benefit the Conditional aggregation method, because it will scan the full index anyway. The only benefit would be if the index is narrower than the table and engine would have to read fewer pages into memory.

    了解了这一点,您就可以决定选择哪种方法了.

    Knowing this you can decide which method to choose.

    我做了一个更大的测试表,有 500 万行.表上没有索引.我使用 SQL Sentry Plan Explorer 测量了 IO 和 CPU 统计信息.我使用 SQL Server 2014 SP1-CU7 (12.0.4459.0) Express 64 位进行这些测试.

    I made a larger test table, with 5M rows. There were no indexes on the table. I measured the IO and CPU stats using SQL Sentry Plan Explorer. I used SQL Server 2014 SP1-CU7 (12.0.4459.0) Express 64-bit for these tests.

    确实,您的原始查询的行为与您描述的一样,即子查询速度更快,即使读取次数高 3 倍.

    Indeed, your original queries behaved as you described, i.e. subqueries were faster even though the reads were 3 times higher.

    在没有索引的表上尝试几次后,我重写了您的条件聚合并添加了变量来保存 DATEADD 表达式的值.

    After few tries on a table without an index I rewrote your conditional aggregate and added variables to hold the value of DATEADD expressions.

    总体时间明显加快.

    然后我用 COUNT 替换了 SUM,它又变得更快了.

    Then I replaced SUM with COUNT and it became a little bit faster again.

    毕竟,条件聚合变得和子查询一样快.

    After all, conditional aggregation became pretty much as fast as subqueries.

    预热缓存 (CPU=375)

    SELECT -- warm cache
        COUNT(*) AS all_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    子查询 (CPU=1031)

    Subqueries (CPU=1031)

    SELECT -- subqueries
    (
        SELECT count(*) FROM LogTable 
    ) all_cnt, 
    (
        SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-1,GETDATE())
    ) last_year_cnt,
    (
        SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-10,GETDATE())
    ) last_ten_year_cnt
    OPTION (RECOMPILE);
    

    原始条件聚合 (CPU=1641)

    Original conditional aggregation (CPU=1641)

    SELECT -- conditional original
        COUNT(*) AS all_cnt,
        SUM(CASE WHEN datesent > DATEADD(year,-1,GETDATE())
                 THEN 1 ELSE 0 END) AS last_year_cnt,
        SUM(CASE WHEN datesent > DATEADD(year,-10,GETDATE())
                 THEN 1 ELSE 0 END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    带变量的条件聚合 (CPU=1078)

    Conditional aggregation with variables (CPU=1078)

    DECLARE @VarYear1 datetime = DATEADD(year,-1,GETDATE());
    DECLARE @VarYear10 datetime = DATEADD(year,-10,GETDATE());
    
    SELECT -- conditional variables
        COUNT(*) AS all_cnt,
        SUM(CASE WHEN datesent > @VarYear1
                 THEN 1 ELSE 0 END) AS last_year_cnt,
        SUM(CASE WHEN datesent > @VarYear10
                 THEN 1 ELSE 0 END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    使用变量和 COUNT 而不是 SUM 的条件聚合 (CPU=1062)

    Conditional aggregation with variables and COUNT instead of SUM (CPU=1062)

    SELECT -- conditional variable, count, not sum
        COUNT(*) AS all_cnt,
        COUNT(CASE WHEN datesent > @VarYear1
                 THEN 1 ELSE NULL END) AS last_year_cnt,
        COUNT(CASE WHEN datesent > @VarYear10
                 THEN 1 ELSE NULL END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    基于这些结果,我的猜测是 CASE 为每一行调用了 DATEADD,而 WHERE 足够聪明,可以计算一次.加上 COUNTSUM 效率高一点点.

    Based on these results my guess is that CASE invoked DATEADD for each row, while WHERE was smart enough to calculate it once. Plus COUNT is a tiny bit more efficient than SUM.

    最后,条件聚合只比子查询稍慢(1062 vs 1031),可能是因为 WHERE 本身比 CASE 更有效,此外, WHERE 过滤掉了相当多的行,所以 COUNT 必须处理更少的行.

    In the end, conditional aggregation is only slightly slower than subqueries (1062 vs 1031), maybe because WHERE is a bit more efficient than CASE in itself, and besides, WHERE filters out quite a few rows, so COUNT has to process less rows.

    在实践中我会使用条件聚合,因为我认为读取次数更重要.如果您的表很小,无法容纳并留在缓冲池中,那么对于最终用户来说,任何查询都会很快.但是,如果表大于可用内存,那么我预计从磁盘读取会显着减慢子查询.

    In practice I would use conditional aggregation, because I think that number of reads is more important. If your table is small to fit and stay in the buffer pool, then any query will be fast for the end user. But, if the table is larger than available memory, then I expect that reading from disk would slow subqueries significantly.

    另一方面,尽早过滤出行也很重要.

    On the other hand, filtering the rows out as early as possible is also important.

    这是测试的一个细微变化,它演示了它.在这里,我将阈值设置为 GETDATE() + 100 年,以确保没有任何行满足过滤条件.

    Here is a slight variation of the test, which demonstrates it. Here I set the threshold to be GETDATE() + 100 years, to make sure that no rows satisfy the filter criteria.

    预热缓存 (CPU=344)

    SELECT -- warm cache
        COUNT(*) AS all_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    子查询 (CPU=500)

    SELECT -- subqueries
    (
        SELECT count(*) FROM LogTable 
    ) all_cnt, 
    (
        SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,100,GETDATE())
    ) last_year_cnt
    OPTION (RECOMPILE);
    

    原始条件聚合 (CPU=937)

    SELECT -- conditional original
        COUNT(*) AS all_cnt,
        SUM(CASE WHEN datesent > DATEADD(year,100,GETDATE())
                 THEN 1 ELSE 0 END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    带变量的条件聚合 (CPU=750)

    DECLARE @VarYear100 datetime = DATEADD(year,100,GETDATE());
    
    SELECT -- conditional variables
        COUNT(*) AS all_cnt,
        SUM(CASE WHEN datesent > @VarYear100
                 THEN 1 ELSE 0 END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    使用变量和 COUNT 而不是 SUM 的条件聚合 (CPU=750)

    SELECT -- conditional variable, count, not sum
        COUNT(*) AS all_cnt,
        COUNT(CASE WHEN datesent > @VarYear100
                 THEN 1 ELSE NULL END) AS last_ten_year_cnt
    FROM LogTable
    OPTION (RECOMPILE);
    

    下面是一个带有子查询的计划.您可以看到在第二个子查询中,有 0 行进入了 Stream Aggregate,所有这些行都在 Table Scan 步骤中被过滤掉了.

    Below is a plan with subqueries. You can see that 0 rows went into the Stream Aggregate in the second subquery, all of them were filtered out at the Table Scan step.

    因此,子查询再次变得更快.

    As a result, subqueries are again faster.

    这里我改变了之前测试的过滤标准:所有的>都换成了<.结果,条件 COUNT 计算所有行而不是无.惊喜,惊喜!条件聚合查询花费了同样的 750 毫秒,而子查询变为 813 而不是 500.

    Here I changed the filtering criteria of the previous test: all > were replaced with <. As a result, the conditional COUNT counted all rows instead of none. Surprise, surprise! Conditional aggregation query took same 750 ms, while subqueries became 813 instead of 500.

    这是子查询的计划:

    你能给我举个例子,其中条件聚合尤其明显优于子查询解决方案?

    Could you give me an example, where conditional aggregation notably outperforms the subquery solution?

    给了.子查询方法的性能取决于数据分布.条件聚合的性能不依赖于数据分布.

    Here it is. Performance of subqueries method depends on the data distribution. Performance of conditional aggregation does not depend on the data distribution.

    子查询方法可以比条件聚合更快或更慢,这取决于数据分布.

    Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

    了解了这一点,您就可以决定选择哪种方法了.

    Knowing this you can decide which method to choose.

    如果您将鼠标悬停在 Table Scan 运算符上,您可以看到不同变体中的 Actual Data Size.

    If you hover the mouse over the Table Scan operator you can see the Actual Data Size in different variants.

    1. 简单的COUNT(*):

    1. 条件聚合:

    1. 测试 2 中的子查询:

    1. 测试 3 中的子查询:

    现在很明显,性能差异很可能是由于流经计划的数据量不同造成的.

    Now it becomes clear that the difference in performance is likely caused by the difference in the amount of data that flows through the plan.

    在简单的COUNT(*)的情况下,没有输出列表(不需要列值),数据大小最小(43MB).

    In case of simple COUNT(*) there is no Output list (no column values are needed) and data size is smallest (43MB).

    在条件聚合的情况下,此数量在测试 2 和 3 之间不会改变,始终为 72MB.输出列表有一列datesent.

    In case of conditional aggregation this amount doesn't change between tests 2 and 3, it is always 72MB. Output list has one column datesent.

    在子查询的情况下,此数量确实会根据数据分布而变化.

    In case of subqueries, this amount does change depending on the data distribution.

    这篇关于条件聚合性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆