蜂巢解释计划理解 [英] Hive explain plan understanding

查看:90
本文介绍了蜂巢解释计划理解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何适当的资源可以让我们完全理解hive生成的解释计划?我尝试在Wiki中搜索它,但是找不到完整的指南来理解它. 这是Wiki,它简要说明了解释计划的工作原理.但是我需要有关如何推断解释计划的更多信息. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

Is there any proper resource from where we can understand explain plan generated by hive completely? I have tried searching it in the wiki but could not find a complete guide to understand it. Here is the wiki which briefly explains how explain plan works. But I need further information on how to infer the explain plan. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

推荐答案

我将尝试向我解释我所知的东西.

I will try to explain a litte what I know.

执行计划是对查询所需任务的描述,执行它们的顺序以及每个任务的一些详细信息. 要查看查询的执行计划,您可以执行以下操作,在查询前面添加关键字EXPLAIN,然后运行它. 执行计划可能很长而且很复杂. 要完全了解它们,需要对MapReduce有深刻的了解.

The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task. To see an execution plan for a query, you can do this, prefix the query with the keyword EXPLAIN, then run it. Execution plans can be long and complex. Fully understanding them requires a deep knowledge of MapReduce.

示例

EXPLAIN CREATE TABLE flights_by_carrier AS 
SELECT carrier, COUNT(flight) AS num 
FROM flights 
GROUP BY carrier;

此查询是一个CTAS statement,它将创建一个名为flight_by_carrier的新表,并使用SELECT query的结果填充该表. SELECT query按承运人对航班表的行进行分组,并返回每个承运人以及该承运人的航班数量.

This query is a CTAS statement that creates a new table named flights_by_carrier and populates it with the result of a SELECT query. The SELECT query groups the rows of the flights table by carrier and returns each carrier and the number of flights for that carrier.

此示例的EXPLAIN语句的Hive输出显示在这里

Hive's output of the EXPLAIN statement for the example is shown here

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|   Stage-3 depends on stages: Stage-0               |
|   Stage-2 depends on stages: Stage-3               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: flights                         |
|             Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: carrier (type: string), flight (type: smallint) |
|               outputColumnNames: carrier, flight   |
|               Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(flight)        |
|                 keys: carrier (type: string)       |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.TextInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                 name: fly.flights_by_carrier       |
|                                                    |
|   Stage: Stage-0                                   |
|     Move Operator                                  |
|       files:                                       |
|           hdfs directory: true                     |
|           destination: hdfs://localhost:8020/user/hive/warehouse/fly.db/flights_by_carrier |
|                                                    |
|   Stage: Stage-3                                   |
|       Create Table Operator:                       |
|         Create Table                               |
|           columns: carrier string, num bigint      |
|           input format: org.apache.hadoop.mapred.TextInputFormat |
|           output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat |
|           serde name: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           name: fly.flights_by_carrier             |
|                                                    |
|   Stage: Stage-2                                   |
|     Stats-Aggr Operator                            |
|                                                    |
+----------------------------------------------------+--+

阶段依赖性

该示例查询将在四个stages(阶段0到阶段3)中执行. 每个stage可以是MapReduce作业,HDFS动作,metastore动作或Hive server执行的其他动作.

The example query will execute in four stages, Stage-0 to Stage-3. Each stage could be a MapReduce job, an HDFS action, a metastore action, or some other action performed by the Hive server.

编号并不表示执行顺序或依存关系.

The numbering does not imply an order of execution or dependency.

阶段之间的依赖关系决定了它们必须执行的顺序,而HiveEXPLAIN结果开始时明确指定了这些依赖关系.

The dependencies between stages determine the order in which they must execute, and Hive specifies these dependencies explicitly at the start of the EXPLAIN results.

根阶段,如本例中的Stage-1,没有依赖性,可以先运行.

A root stage, like Stage-1 in this example, has no dependencies and is free to run first.

非根目录阶段无法运行,直到它们所依赖的阶段完成为止.

Non-root stages cannot run until the stages upon which they depend have completed.

阶段计划

输出的阶段计划部分显示了阶段的描述. 对于Hive,请从顶部开始然后向下阅读它们.

The stage plans part of the output shows descriptions of the stages. For Hive, read them by starting at the top and then going down.

阶段1被标识为MapReduce作业.

查询计划显示此作业同时包含map phase(由Map Operator树描述)和reduce phase(由Reduce Operator Tree描述). 在map phase中,地图任务会读取航班表并选择承运人和航班列.

The query plan shows that this job includes both a map phase (described by the Map Operator Tree) and a reduce phase (described by the Reduce Operator Tree). In the map phase, the map tasks read the flights table and select the carrier and flights columns.

此数据传递到reduce phase,在其中,reduce任务按承运人对数据进行分组,并通过计算航班数对其进行汇总.

This data is passed to the reduce phase, in which the reduce tasks group the data by carrier and aggregate it by counting flights.

在阶段1之后是阶段0,这是HDFS动作(移动).

Following Stage-1 is Stage-0, which is an HDFS action (Move).

在此阶段,Hive将上一阶段的输出移至HDFS中仓库目录中的新子目录. 这是新表的存储目录,它将被命名为flight_by_carrier.

In this stage, Hive moves the output of the previous stage to a new subdirectory in the warehouse directory in HDFS. This is the storage directory for the new table that will be named flights_by_carrier.

在阶段0之后是阶段3,这是metastore动作:

Following Stage-0 is Stage-3, which is a metastore action:

创建表.

在此阶段,Hive在运行数据库中创建一个名为flight_by_carrier的新表. 该表有两列:名为载波的STRING列和名为num的BIGINT列.

In this stage, Hive creates a new table named flights_by_carrier in the fly database. The table has two columns: a STRING column named carrier and a BIGINT column named num.

最后阶段Stage-2,收集统计信息.

The final stage, Stage-2, collects statistics.

此最后阶段的详细信息并不重要,但是它会收集信息,例如表中的行数,HDFS中存储表数据的文件数以及每列中的唯一值数.在桌子上. 这些统计信息可用于优化Hive查询.

The details of this final stage are not important, but it gathers information such as the number of rows in the table, the number of files that store the table data in HDFS, and the number of unique values in each column in the table. These statistics can be used to optimize Hive queries.

这篇关于蜂巢解释计划理解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆