在Hive或Impala中计算表统计信息如何加快Spark SQL中的查询? [英] How does computing table stats in hive or impala speed up queries in Spark SQL?

查看:372
本文介绍了在Hive或Impala中计算表统计信息如何加快Spark SQL中的查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了提高性能(例如用于联接),建议首先计算表静态值.

For increasing performance (e.g. for joins) it is recommended to compute table statics first.

在Hive中我可以做::

In Hive I can do::

analyze table <table name> compute statistics;

在Impala中:

compute stats <table name>;

我的spark应用程序(从hive表中读取)是否也从预先计算的统计信息中受益?如果是,我需要运行哪一个?他们都将统计信息保存在配置单元metastore中吗?我在Cloudera 5.5.4上使用spark 1.6.1

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics? If yes, which one do I need to run? Are they both saving the stats in the hive metastore? I'm using spark 1.6.1 on Cloudera 5.5.4

注意: 在Spark 1.6.1的文档中( https://spark .apache.org/docs/1.6.1/sql-programming-guide.html )中的参数spark.sql.autoBroadcastJoinThreshold我发现了一个提示:

Note: In the Docs of spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) for the parameter spark.sql.autoBroadcastJoinThreshold I found a hint:

请注意,当前统计信息仅受Hive Metastore支持 命令ANALYZE TABLE COMPUTE STATISTICS的表 noscan已运行.

Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.

推荐答案

这是即将发布的Spark 2.3.0(也许某些功能已在2.2.1或更早的版本中发布).

This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).

我的spark应用程序(从hive表中读取)是否也从预先计算的统计信息中受益?

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?

Impala或Hive可能将表统计信息(例如表大小或行数)记录在Spark可以读取的表元数据中的Hive元存储中(并转换为自己的Spark统计信息以进行查询计划).

It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).

您可以使用spark-shell中的DESCRIBE EXTENDED SQL命令轻松地将其检出.

You can easily check it out by using DESCRIBE EXTENDED SQL command in spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

ANALYZE TABLE COMPUTE STATISTICS noscan计算Spark使用的一个统计信息,即表的总大小(由于noscan选项而没有行计数指标).如果Impala和Hive将其记录到适当的"位置,Spark SQL将在DESC EXTENDED中显示它.

ANALYZE TABLE COMPUTE STATISTICS noscan computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED.

使用DESC EXTENDED tableName进行表级统计,并查看是否找到了由Impala或Hive生成的统计信息.如果它们在DESC EXTENDED的输出中,则将用于优化联接(并且针对聚合和过滤器也启用了基于成本的优化).

Use DESC EXTENDED tableName for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).

列统计信息存储在表属性中(以Spark特定的序列化格式),我真的怀疑Impala或Hive是否可以计算统计信息并将其存储为Spark SQL兼容格式.

Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.

这篇关于在Hive或Impala中计算表统计信息如何加快Spark SQL中的查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆