谓词下推究竟是如何工作的? [英] How does predicate pushdown work exactly?

查看:68
本文介绍了谓词下推究竟是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能用例子解释一下谓​​词下推到底是如何工作的?

Could anyone please explain with examples how exactly predicate pushdown works?

推荐答案

说你要执行一个查询

SELECT 
  SUM(price) 
FROM sales 
WHERE 
  purchase_date BETWEEN '2018-01-01' and '2018-01-31';

查询引擎的一个非常简单的实现是遍历所有 parquet/orc 文件,反序列化 pricepurchase_date 列,在 上应用谓词purchase_date 并对过滤后的行求和.

A very trivial implementation of a query engine is to iterate over all parquet/orc files, deserialize the price and purchase_date columns, apply the predicate on the purchase_date and sum the filtered rows.

Parquet(不确定 orc)维护每个文件中列的统计信息,因此如果执行引擎足够聪明,它可以查看统计信息中 purchase_date 的最小值/最大值,并确定是否有任何行要匹配.例如,如果 purchase_date.min=2014-05-05purchase_date.max=2014-05-06,它可以推断出谓词的计算结果总是为假.

Parquet (not sure about orc) maintains statistics on the columns in each file, so if the execution engine is smart enough, it can look at the min/max of the purchase_date within the statistics and determine if any rows is going to match. For example, if purchase_date.min=2014-05-05 and purchase_date.max=2014-05-06, it can deduce that the predicate will always evaluate to false.

换句话说,它可以通过结合统计和过滤谓词来跳过镶木地板文件.这会导致性能的大幅提升,因为 IO(文件或内存)通常是瓶颈.增益与选择性(匹配行的百分比)成反比.

In other words, it can skip parquet files by combining statistics and the filter predicate . This can lead to massive gain of performance because IO (file or memory) is usually the bottleneck. The gain is inversely proportional to the selectivity (the percentage of matching rows).

术语谓词下推来自这样一个事实,即您使用谓词暗示"扫描运算符,然后将使用该谓词过滤感兴趣的行.或者,将谓词推送到扫描.

The term predicate push-down comes from the fact that you're "hinting" the scan operator with the predicate that is then going to be used to filter the rows of interest. Or, pushing the predicate to the scan.

这篇关于谓词下推究竟是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆