表格统计信息在Spark 2.2之前是否有用? [英] Are table statistics of any use prior to Spark 2.2?

查看:82
本文介绍了表格统计信息在Spark 2.2之前是否有用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 2.2引入了基于成本的优化(CBO,

Spark 2.2 introduced cost-based optimization (CBO, https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html) which makes use of table statistics (as computed by ANALYZE TABLE COMPUTE STATISTICS....)

我的问题是:在Spark 2.2(在我的情况下为2.1)在(外部配置单元)表上运行之前,预先计算的统计信息是否也有用?统计信息会影响优化器吗?如果是,我还可以在Impala中代替Hive计算统计信息吗?

My question is: Are precomputed statistics also useful prior to Spark 2.2 (in my case 2.1) operating on (external hive) tables? Do statistics influence the optimizer? If yes, can I also compute the statistics in Impala instead of Hive?

更新:

到目前为止,我发现的唯一提示是 https://issues.apache.org/jira/browse/SPARK-15365

The only hint I have found so far is https://issues.apache.org/jira/browse/SPARK-15365

显然,统计信息用于确定是否完成广播加入

Apparently statistics are used to decide whether a broadcast-join is done are not

推荐答案

显然,统计信息用于确定是否完成广播加入

Apparently statistics are used to decide whether a broadcast-join is done are not

正如您在UPDATE中提到的,没有打开基于成本的优化,表统计信息(使用ANALYZE TABLE COMPUTE STATISTICS计算)仅用于

As you mentioned in UPDATE with no cost-based optimization turned on the table statistics (computed using ANALYZE TABLE COMPUTE STATISTICS) are only used in JoinSelection execution planning strategy that will choose BroadcastHashJoinExec or BroadcastNestedLoopJoinExec physical operators.

JoinSelection使用默认为10M的spark.sql.autoBroadcastJoinThreshold配置属性.

JoinSelection uses spark.sql.autoBroadcastJoinThreshold configuration property that is 10M by default.

这篇关于表格统计信息在Spark 2.2之前是否有用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆