在 hive 或 impala 中计算表统计数据如何加快 Spark SQL 中的查询速度? [英] How does computing table stats in hive or impala speed up queries in Spark SQL?

查看:20
本文介绍了在 hive 或 impala 中计算表统计数据如何加快 Spark SQL 中的查询速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了提高性能(例如连接),建议首先计算表静态.

For increasing performance (e.g. for joins) it is recommended to compute table statics first.

在 Hive 中我可以做到::

In Hive I can do::

analyze table <table name> compute statistics;

在 Impala 中:

In Impala:

compute stats <table name>;

我的 spark 应用程序(从 hive 表中读取)是否也受益于预先计算的统计数据?如果是,我需要运行哪一个?他们是否都将统计数据保存在 hive Metastore 中?我在 Cloudera 5.5.4 上使用 spark 1.6.1

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics? If yes, which one do I need to run? Are they both saving the stats in the hive metastore? I'm using spark 1.6.1 on Cloudera 5.5.4

注意:在 spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) 对于参数 spark.sql.autoBroadcastJoinThreshold 我发现了一个提示:

Note: In the Docs of spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) for the parameter spark.sql.autoBroadcastJoinThreshold I found a hint:

请注意,目前仅 Hive Metastore 支持统计信息表,其中命令 ANALYZE TABLE COMPUTE STATISTICS已运行 noscan.

Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.

推荐答案

这里是即将发布的 Spark 2.3.0(也许某些功能已经在 2.2.1 或更早的版本中发布了).

This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).

我的 spark 应用程序(从 hive 表读取)是否也受益于预先计算的统计数据?

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?

如果 Impala 或 Hive 将 Hive 元存储中的表统计信息(例如表大小或行数)记录在 Spark 可以读取的表元数据中(并将其转换为自己的 Spark 统计信息以进行查询计划).

It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).

您可以在 spark-shell 中使用 DESCRIBE EXTENDED SQL 命令轻松检查它.

You can easily check it out by using DESCRIBE EXTENDED SQL command in spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

ANALYZE TABLE COMPUTE STATISTICS noscan 计算 Spark 使用的一项统计数据,即表的总大小(由于 noscan 选项而没有行计数指标).如果 Impala 和 Hive 将其记录到适当"的位置,Spark SQL 将在 DESC EXTENDED 中显示它.

ANALYZE TABLE COMPUTE STATISTICS noscan computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED.

使用 DESC EXTENDED tableName 进行表级统计,看看是否找到了 Impala 或 Hive 生成​​的统计信息.如果它们在 DESC EXTENDED 的输出中,它们将用于优化连接(并且也为聚合和过滤器打开基于成本的优化).

Use DESC EXTENDED tableName for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).

列统计信息(以 Spark 特定的序列化格式)存储在表属性中,我真的怀疑 Impala 或 Hive 可以计算统计信息并将它们存储在 Spark SQL 兼容格式中.

Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.

这篇关于在 hive 或 impala 中计算表统计数据如何加快 Spark SQL 中的查询速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆