表格统计信息在Spark 2.2之前是否有用? [英] Are table statistics of any use prior to Spark 2.2?
问题描述
Spark 2.2 introduced cost-based optimization (CBO, https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html) which makes use of table statistics (as computed by ANALYZE TABLE COMPUTE STATISTICS....
)
我的问题是:在Spark 2.2(在我的情况下为2.1)在(外部配置单元)表上运行之前,预先计算的统计信息是否也有用?统计信息会影响优化器吗?如果是,我还可以在Impala中代替Hive计算统计信息吗?
My question is: Are precomputed statistics also useful prior to Spark 2.2 (in my case 2.1) operating on (external hive) tables? Do statistics influence the optimizer? If yes, can I also compute the statistics in Impala instead of Hive?
更新:
到目前为止,我发现的唯一提示是 https://issues.apache.org/jira/browse/SPARK-15365
The only hint I have found so far is https://issues.apache.org/jira/browse/SPARK-15365
显然,统计信息用于确定是否完成广播加入
Apparently statistics are used to decide whether a broadcast-join is done are not
推荐答案
显然,统计信息用于确定是否完成广播加入
Apparently statistics are used to decide whether a broadcast-join is done are not
正如您在UPDATE中提到的,没有打开基于成本的优化,表统计信息(使用ANALYZE TABLE COMPUTE STATISTICS
计算)仅用于
As you mentioned in UPDATE with no cost-based optimization turned on the table statistics (computed using ANALYZE TABLE COMPUTE STATISTICS
) are only used in JoinSelection execution planning strategy that will choose BroadcastHashJoinExec
or BroadcastNestedLoopJoinExec
physical operators.
JoinSelection
使用默认为10M的spark.sql.autoBroadcastJoinThreshold
配置属性.
JoinSelection
uses spark.sql.autoBroadcastJoinThreshold
configuration property that is 10M by default.
这篇关于表格统计信息在Spark 2.2之前是否有用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!