在 hive 或 impala 中计算表统计数据如何加快 Spark SQL 中的查询速度? [英] How does computing table stats in hive or impala speed up queries in Spark SQL?
问题描述
为了提高性能(例如连接),建议首先计算表静态.
For increasing performance (e.g. for joins) it is recommended to compute table statics first.
在 Hive 中我可以做到::
In Hive I can do::
analyze table <table name> compute statistics;
在 Impala 中:
In Impala:
compute stats <table name>;
我的 spark 应用程序(从 hive 表中读取)是否也受益于预先计算的统计数据?如果是,我需要运行哪一个?他们是否都将统计数据保存在 hive Metastore 中?我在 Cloudera 5.5.4 上使用 spark 1.6.1
Does my spark application (reading from hive-tables) also benefit from pre-computed statistics? If yes, which one do I need to run? Are they both saving the stats in the hive metastore? I'm using spark 1.6.1 on Cloudera 5.5.4
注意:在 spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) 对于参数 spark.sql.autoBroadcastJoinThreshold
我发现了一个提示:
Note:
In the Docs of spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) for the parameter spark.sql.autoBroadcastJoinThreshold
I found a hint:
请注意,目前仅 Hive Metastore 支持统计信息表,其中命令 ANALYZE TABLE COMPUTE STATISTICS已运行 noscan.
Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
推荐答案
这里是即将发布的 Spark 2.3.0(也许某些功能已经在 2.2.1 或更早的版本中发布了).
This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).
我的 spark 应用程序(从 hive 表读取)是否也受益于预先计算的统计数据?
Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?
如果 Impala 或 Hive 将 Hive 元存储中的表统计信息(例如表大小或行数)记录在 Spark 可以读取的表元数据中(并将其转换为自己的 Spark 统计信息以进行查询计划).
It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).
您可以在 spark-shell
中使用 DESCRIBE EXTENDED
SQL 命令轻松检查它.
You can easily check it out by using DESCRIBE EXTENDED
SQL command in spark-shell
.
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+
ANALYZE TABLE COMPUTE STATISTICS noscan
计算 Spark 使用的一项统计数据,即表的总大小(由于 noscan
选项而没有行计数指标).如果 Impala 和 Hive 将其记录到适当"的位置,Spark SQL 将在 DESC EXTENDED
中显示它.
ANALYZE TABLE COMPUTE STATISTICS noscan
computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan
option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED
.
使用 DESC EXTENDED tableName
进行表级统计,看看是否找到了 Impala 或 Hive 生成的统计信息.如果它们在 DESC EXTENDED
的输出中,它们将用于优化连接(并且也为聚合和过滤器打开基于成本的优化).
Use DESC EXTENDED tableName
for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED
's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).
列统计信息(以 Spark 特定的序列化格式)存储在表属性中,我真的怀疑 Impala 或 Hive 可以计算统计信息并将它们存储在 Spark SQL 兼容格式中.
Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.
这篇关于在 hive 或 impala 中计算表统计数据如何加快 Spark SQL 中的查询速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!