在vertica中优化连接 [英] Optimizing join in vertica

查看:200
本文介绍了在vertica中优化连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的查询

  SELECT a.column,b.column 
FROM
table_a INNER JOIN tableb_b ON
a.id = b.id
其中a.anotherid ='某些条件'

这应该非常快,因为使用谓词a.anotherid ='some condition'时,查询计划应在table_b上过滤大量数据。
但是,根据



那么,有没有办法在加入连接之前推送过滤器?
还是有其他方法可以优化查询?

解决方案


  1. EXPLAIN 显示无统计信息。这些必须更新

  2. 在这种情况下,Vertica将使用SIP优化谓词:




Sideways Information Passing(SIP)通过在计划中尽早过滤数据,有效提高了连接性能。可以将其视为谓词下推的高级变体,因为该联接被用于进行过滤[27]。例如,考虑一个使用简单的相等谓词连接两个表的HashJoin。 HashJoin将首先从内部输入创建哈希表,然后再开始从外部输入读取数据以进行联接。特殊的SIP筛选器是在优化器计划期间构建的,并放置在扫描操作器中。在运行时,扫描程序可以访问Join的哈希表,并且SIP筛选器用于评估哈希表中是否存在外键值。扫描不会输出未通过这些过滤器的行,从而提高了性能,因为我们不必不必要地使数据通过计划,而仅在以后由联接过滤掉。


例如:

  SELECT a.online_page_key 
FROM online_sales.online_sales_fact a
JOIN online_sales.online_page_dimension b
ON b.online_page_key = a.online_page_key
WHERE b.page_type ='季度';

将产生与以下计划相同的计划:

  SELECT a.online_page_key 
FROM online_sales.online_sales_fact a
JOIN(选择*
FROM online_sales.online_page_dimension
WHERE page_type ='quarterly') b
ON b.online_page_key = a.online_page_key;

看起来像这样:

 访问路径:
+ -JOIN哈希[费用:14K,行:988K](PATH ID:1)
|加入Cond:(online_page_dimension.online_page_key = a.online_page_key)
| +-外部-> [成本:12K,行:5M]的存储访问(路径ID:2) |投影:online_sales.online_sales_fact_super
| |实现:a.online_page_key
| |运行时过滤器:(SIP1(HashJoin):a.online_page_key)
| +-内在-> online_page_dimension的存储访问权限[费用:36,行数:198](路径ID:3)
| |投影:online_sales.online_page_dimension_super
| |实现:online_page_dimension.online_page_key
| |过滤器:(online_page_dimension.page_type ='季度')




  1. 大多数情况下,哈希联接就足够了。如果您想改善合并联接,请参见优化合并联接


I hava a query like this

SELECT a.column, b.column
FROM
table_a a INNER JOIN tableb_b ON
a.id= b.id
where a.anotherid = 'some condition'

It is supposed to be very fast because with the predicate a.anotherid = 'some condition' the query plan should filter much data on table_b. However, according to the document of Vertica,

The WHERE clause is evaluated after the join is performed. It filters records returned by the FROM clause, eliminating any records that do not satisfy the WHERE clause condition.

Which is mean the query will do the join first and then filtering, which is very slow, this is also showed in the query plan

So, is there any way to push the filter before the join? Or is there any other way to optimize the query?

解决方案

  1. The EXPLAIN shows NO STATISTICS. These need to be updated.
  2. Vertica will optimize the predicate in this case using SIP:

Sideways Information Passing (SIP) has been effective in improving join performance by filtering data as early as possible in the plan. It can be thought of as an advanced variation of predicate push down since the join is being used to do filtering [27]. For example, consider a HashJoin that joins two tables using simple equality predicates. The HashJoin will first create a hash table from the inner input before it starts reading data from the outer input to do the join. Special SIP filters are built during optimizer planning and placed in the Scan operator. At run time, the Scan has access to the Join’s hash table and the SIP filters are used to evaluate whether the outer key values exist in the hash table. Rows that do not pass these filters are not output by the Scan thus increaseing performance since we are not unnecessarily bringing the data through the plan only to be filtered away later by the join.

For example:

SELECT a.online_page_key
FROM   online_sales.online_sales_fact a
       JOIN online_sales.online_page_dimension b
         ON b.online_page_key = a.online_page_key
WHERE  b.page_type = 'quarterly';

Will produce the same plan as:

SELECT a.online_page_key 
FROM   online_sales.online_sales_fact a 
       JOIN (SELECT * 
             FROM   online_sales.online_page_dimension 
             WHERE  page_type = 'quarterly') b 
         ON b.online_page_key = a.online_page_key;

Which looks like:

 Access Path:
 +-JOIN HASH [Cost: 14K, Rows: 988K] (PATH ID: 1)
 |  Join Cond: (online_page_dimension.online_page_key = a.online_page_key)
 | +-- Outer -> STORAGE ACCESS for a [Cost: 12K, Rows: 5M] (PATH ID: 2)
 | |      Projection: online_sales.online_sales_fact_super
 | |      Materialize: a.online_page_key
 | |      Runtime Filter: (SIP1(HashJoin): a.online_page_key)
 | +-- Inner -> STORAGE ACCESS for online_page_dimension [Cost: 36, Rows: 198] (PATH ID: 3)
 | |      Projection: online_sales.online_page_dimension_super
 | |      Materialize: online_page_dimension.online_page_key
 | |      Filter: (online_page_dimension.page_type = 'quarterly')

  1. Most times, a hash join is sufficient. If you want to improve for a merge join, see my post on optimizing for merge join.

这篇关于在vertica中优化连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆