递归相关表达式如何加速不同的查询? [英] How does a recursive correlated expression speed up distinct queries?

查看:41
本文介绍了递归相关表达式如何加速不同的查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现这篇博文关于加速不同的查询:

I found this post about speeding up distinct queries:

使用递归 CTE 的超快速 DISTINCT:

Super-fast DISTINCT using a recursive CTE:

USE     tempdb;
GO
DROP    TABLE dbo.Test;
GO
CREATE  TABLE 
        dbo.Test 
        (
        data            INTEGER NOT NULL,
        );
GO
CREATE  CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT  dbo.Test WITH (TABLOCK)
        (data)
SELECT  TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM    master.sys.columns C1,
        master.sys.columns C2,
        master.sys.columns C3;
GO



SET     STATISTICS TIME ON;

-- 1591ms CPU
SELECT  DISTINCT 
        data
FROM    dbo.Test;

-- 15 毫秒 CPU

-- 15ms CPU

WITH    RecursiveCTE
AS      (
        SELECT  data = MIN(T.data)
        FROM    dbo.Test T
        UNION   ALL
        SELECT  R.data
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE Smile
                SELECT  T.data,
                        rn = ROW_NUMBER() OVER (ORDER BY T.data)
                FROM    dbo.Test T
                JOIN    RecursiveCTE R
                        ON  R.data < T.data
                ) R
        WHERE   R.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);

SET     STATISTICS TIME OFF;
GO
DROP    TABLE dbo.Test;

递归 CTE 的效率提高了 100 倍 :-) 这种加速对我当前的项目非常有价值,但我不确定这种方法在哪些情况下是有益的.

The recursive CTE is 100 times more efficient :-) This kind of speedup would be extremely valuable for my current project, but I am not sure in which cases this approach is beneficial.

老实说:我不明白为什么这会加快查询速度,以及为什么数据库不能自己进行这种优化.你能解释一下这是如何工作的以及为什么它如此有效吗?

To be honest: I don't get why this speeds up the query that much and why the database cannot do this optimization itself. Can you explain how this works and why it is so effective?

我在 sybase 上看到了类似的效果,所以这种方法似乎只对 sql-server 有效.

I see a similar effect on sybase, so this approach does not seem to be valid for sql-server only.

子问题:递归 CTE 对其他数据库系统也有用吗?

Sub-question: is the recursive CTE useful for other data base systems as well?

推荐答案

Paul White 解释了这个技巧";在他的帖子性能调优整个查询计划寻找不同的值部分.

Paul White explained this "trick" in great detail in his post Performance Tuning the Whole Query Plan under Finding the Distinct Values section.

为什么数据库不能自己做这个优化?

Why the database cannot do this optimization itself?

递归 CTE 对其他数据库系统也有用吗?

Is the recursive CTE useful for other data base systems as well?

优化器并不完美,它没有实现所有可能的技术.人们要求微软实施它.请参阅此连接项实施索引跳过扫描.它被关闭为 Won't Fix,但这并不意味着它不会在未来得到解决.其他 DBMS 可能已经实现了它(连接项说 Oracle 实现了这个优化).如果这种优化是在 DBMS 引擎中实现的,那么这个技巧"就变得很简单了.不需要,优化器会根据可用的统计数据选择最佳的计算方法.

Optimizer is not perfect and it doesn't implement all possible techniques. People asked Microsoft to implement it. See this Connect item Implement Index Skip Scan. It was closed as Won't Fix, but that doesn't mean that it won't be addressed in the future. Other DBMSes may have implemented it (Connect item says that Oracle implements this optimization). If this kind of optimization is implemented in the DBMS engine, then this "trick" is not needed and optimizer would choose optimal method of calculating the result based on available statistics.

我不明白为什么这会大大加快查询速度.

I don't get why this speeds up the query that much.

我不确定这种方法在哪些情况下是有益的

I am not sure in which cases this approach is beneficial

简单的 DISTINCT 查询扫描整个索引.扫描"意味着它从磁盘读取索引的每一页,并聚合内存(或 tempdb)中的值以获得不同值的列表.

The simple DISTINCT query scans the whole index. "Scans" means that it reads each and every page of the index from the disk and aggregates the values in memory (or tempdb) to get a list of distinct values.

如果您知道该表有很多行,但只有几个不同的不同值,那么读取所有这些重复值是浪费时间.递归 CTE 强制服务器为第一个不同的值寻找索引,然后为第二个值寻找索引,依此类推.寻找"意味着服务器在索引中使用二进制搜索来查找值.通常一次查找只需要从磁盘读取几页.索引"是一棵平衡树.

If you know that the table has a lot of rows, but only few different distinct values, then reading all these repeating values is a waste of time. Recursive CTE forces the server to seek an index for the first distinct value, then seek an index for the second value and so on. "Seek" means that server uses binary search in the index to find the value. Usually one seek requires reading of only few pages from disk. "Index" is a balanced tree.

如果表只有几个不同的值,则查找几次比读取索引的所有页要快.另一方面,如果有很多不同的值,那么顺序读取所有页面会比查找每个连续值更快.这应该能让您了解这种方法在哪些情况下是有益的.

If the table has only few distinct values it is faster to seek few times, than read all pages of the index. On the other hand, if there are a lot of distinct values then it would be faster to read all pages sequentially, than seeking for each consecutive value. This should give you an idea in what cases this approach is beneficial.

显然,如果表很小,扫描它会更快.只有当表变得足够大"时您开始看到性能上的差异.

Obviously, if the table is small, it is faster to scan it. Only when the table becomes "large enough" you start seeing the difference in performance.

dba.se 上有一个相关的问题:是否可以为不同/分组获得基于查找的并行计划?

There is a related question on dba.se: Is it possible to get seek based parallel plan for distinct/group by?

这篇关于递归相关表达式如何加速不同的查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆