在Apache Spark中计算HIVE统计信息 [英] Compute HIVE statistics in Apache Spark

查看:141
本文介绍了在Apache Spark中计算HIVE统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Apache Spark计算HIVE表统计信息:

I'm trying to compute HIVE table statistic from Apache Spark:

`sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS')`

我还执行语句以查看收集到的内容:

I also execute statement to see what was collected:

sqlCtx.sql('DESC FORMATTED t1')

我可以看到我的统计信息已收集.但是,当我在HIVE客户程序(Ambari)中执行相同的工作时-没有显示任何统计信息.如果它是由Spark收集的,则仅对Spark可用吗?Spark是否会将其存储在其他地方?

I can see my stats was collected. However when I execute same staement in HIVE client (Ambari) - there are no statistics displayed. Is it available only to Spark if it's collected by Spark? Does spark store it somewhere else?

另一个问题.

我还计算该表中所有列的统计信息:

I also computing stats for all columns in that table:

sqlCtx.sql('分析表t1列c1,c2的计算统计')

但是当我想在Spark中查看此统计信息时,它失败,并显示了不受支持的sql语句异常:

But when I want to see this stats in spark, it failed with unsupported sql statement exception:

sqlCtx.sql('DESC FORMATTED t1 c1')

根据文档,它是有效的配置单元查询.怎么了?

According to docs it's valid hive queries. What is wrong with it?

感谢您的帮助.

推荐答案

Apache Spark将统计信息存储为表参数".为了能够检索这些统计信息,我们需要连接到HIVE metastore和.执行如下查询

Apache Spark stores statistics as "Table parameters". To be able retrieve these stats, we need to connect to HIVE metastore and . execute query like following

select param_key, param_value 
from table_params tp, tbls t 
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>' 
and param_key like 'spark.sql.stat%';

这篇关于在Apache Spark中计算HIVE统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆