Hive 解释计划理解 [英] Hive explain plan understanding

查看:25
本文介绍了Hive 解释计划理解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有什么合适的资源可以让我们完全理解hive生成的解释计划?我曾尝试在 wiki 中搜索它,但找不到完整的指南来理解它.这是 wiki,它简要解释了解释计划的工作原理.但我需要有关如何推断解释计划的更多信息.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

Is there any proper resource from where we can understand explain plan generated by hive completely? I have tried searching it in the wiki but could not find a complete guide to understand it. Here is the wiki which briefly explains how explain plan works. But I need further information on how to infer the explain plan. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

推荐答案

我会尽量解释我所知道的.

I will try to explain a litte what I know.

执行计划是对查询所需任务的描述、它们的执行顺序以及有关每个任务的一些详细信息.要查看查询的执行计划,您可以这样做,在查询前加上关键字 EXPLAIN,然后运行它.执行计划可能很长而且很复杂.完全理解它们需要深入了解MapReduce.

The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task. To see an execution plan for a query, you can do this, prefix the query with the keyword EXPLAIN, then run it. Execution plans can be long and complex. Fully understanding them requires a deep knowledge of MapReduce.

示例

EXPLAIN CREATE TABLE flights_by_carrier AS 
SELECT carrier, COUNT(flight) AS num 
FROM flights 
GROUP BY carrier;

这个查询是一个CTAS 语句,它创建一个名为flights_by_carrier 的新表,并用SELECT 查询 的结果填充它.SELECT query 按承运人对航班表的行进行分组,并返回每个承运人以及该承运人的航班数.

This query is a CTAS statement that creates a new table named flights_by_carrier and populates it with the result of a SELECT query. The SELECT query groups the rows of the flights table by carrier and returns each carrier and the number of flights for that carrier.

示例的 EXPLAIN 语句的 Hive 输出显示在此处

Hive's output of the EXPLAIN statement for the example is shown here

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|   Stage-3 depends on stages: Stage-0               |
|   Stage-2 depends on stages: Stage-3               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: flights                         |
|             Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: carrier (type: string), flight (type: smallint) |
|               outputColumnNames: carrier, flight   |
|               Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(flight)        |
|                 keys: carrier (type: string)       |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.TextInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                 name: fly.flights_by_carrier       |
|                                                    |
|   Stage: Stage-0                                   |
|     Move Operator                                  |
|       files:                                       |
|           hdfs directory: true                     |
|           destination: hdfs://localhost:8020/user/hive/warehouse/fly.db/flights_by_carrier |
|                                                    |
|   Stage: Stage-3                                   |
|       Create Table Operator:                       |
|         Create Table                               |
|           columns: carrier string, num bigint      |
|           input format: org.apache.hadoop.mapred.TextInputFormat |
|           output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat |
|           serde name: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           name: fly.flights_by_carrier             |
|                                                    |
|   Stage: Stage-2                                   |
|     Stats-Aggr Operator                            |
|                                                    |
+----------------------------------------------------+--+

阶段依赖

示例查询将在四个 stages 中执行,Stage-0 到 Stage-3.每个 stage 可以是一个 MapReduce 作业、一个 HDFS 操作、一个 metastore 操作或由Hive 服务器.

The example query will execute in four stages, Stage-0 to Stage-3. Each stage could be a MapReduce job, an HDFS action, a metastore action, or some other action performed by the Hive server.

编号并不暗示执行顺序或依赖性.

The numbering does not imply an order of execution or dependency.

阶段之间的依赖关系决定了它们必须执行的顺序,HiveEXPLAIN 结果的开始处明确指定这些依赖关系.

The dependencies between stages determine the order in which they must execute, and Hive specifies these dependencies explicitly at the start of the EXPLAIN results.

根阶段,如本例中的 Stage-1,没有依赖项,可以先运行.

A root stage, like Stage-1 in this example, has no dependencies and is free to run first.

非根阶段无法运行,直到它们所依赖的阶段完成.

Non-root stages cannot run until the stages upon which they depend have completed.

阶段计划

输出的阶段计划部分显示了阶段的描述.对于 Hive,从顶部开始然后向下阅读.

The stage plans part of the output shows descriptions of the stages. For Hive, read them by starting at the top and then going down.

Stage-1 被标识为 MapReduce 作业.

Stage-1 is identified as a MapReduce job.

查询计划显示该作业包括一个map阶段(由Map Operator Tree描述)和一个reduce phase(由Reduce Operator Tree描述).在map阶段,map任务读取航班表并选择承运人和航班列.

The query plan shows that this job includes both a map phase (described by the Map Operator Tree) and a reduce phase (described by the Reduce Operator Tree). In the map phase, the map tasks read the flights table and select the carrier and flights columns.

这个数据被传递到reduce阶段,在这个阶段,reduce任务按载体对数据进行分组,并通过计数航班来聚合.

This data is passed to the reduce phase, in which the reduce tasks group the data by carrier and aggregate it by counting flights.

在 Stage-1 之后是 Stage-0,这是一个 HDFS 操作(移动).

Following Stage-1 is Stage-0, which is an HDFS action (Move).

在此阶段,Hive 将上一阶段的输出移动到 HDFS 中仓库目录中的新子目录.这是将命名为 flight_by_carrier 的新表的存储目录.

In this stage, Hive moves the output of the previous stage to a new subdirectory in the warehouse directory in HDFS. This is the storage directory for the new table that will be named flights_by_carrier.

在 Stage-0 之后是 Stage-3,这是一个 metastore 操作:

Following Stage-0 is Stage-3, which is a metastore action:

创建表.

在这个阶段,Hive 在 fly 数据库中创建一个名为 flight_by_carrier 的新表.该表有两列:名为carrier 的STRING 列和名为num 的BIGINT 列.

In this stage, Hive creates a new table named flights_by_carrier in the fly database. The table has two columns: a STRING column named carrier and a BIGINT column named num.

最后阶段,Stage-2,收集统计数据.

The final stage, Stage-2, collects statistics.

这最后阶段的细节并不重要,但它会收集诸如表中的行数、HDFS 中存储表数据的文件数以及表中每一列的唯一值.这些统计信息可用于优化 Hive 查询.

The details of this final stage are not important, but it gathers information such as the number of rows in the table, the number of files that store the table data in HDFS, and the number of unique values in each column in the table. These statistics can be used to optimize Hive queries.

这篇关于Hive 解释计划理解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆