表格统计信息在Spark 2.2之前是否有用? [英] Are table statistics of any use prior to Spark 2.2?

查看：82 发布时间：2020/9/4 8:02:29 apache-spark hive

本文介绍了表格统计信息在Spark 2.2之前是否有用?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark 2.2引入了基于成本的优化(CBO，

Spark 2.2 introduced cost-based optimization (CBO, https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html) which makes use of table statistics (as computed by ANALYZE TABLE COMPUTE STATISTICS....)

我的问题是:在Spark 2.2(在我的情况下为2.1)在(外部配置单元)表上运行之前，预先计算的统计信息是否也有用?统计信息会影响优化器吗?如果是，我还可以在Impala中代替Hive计算统计信息吗?

My question is: Are precomputed statistics also useful prior to Spark 2.2 (in my case 2.1) operating on (external hive) tables? Do statistics influence the optimizer? If yes, can I also compute the statistics in Impala instead of Hive?

更新:

到目前为止，我发现的唯一提示是 https://issues.apache.org/jira/browse/SPARK-15365

The only hint I have found so far is https://issues.apache.org/jira/browse/SPARK-15365

显然，统计信息用于确定是否完成广播加入

Apparently statistics are used to decide whether a broadcast-join is done are not

表格统计信息在Spark 2.2之前是否有用? [英] Are table statistics of any use prior to Spark 2.2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

表格统计信息在Spark 2.2之前是否有用? [英] Are table statistics of any use prior to Spark 2.2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭