如何将hive sql查询作为来自配置单元cli的mr作业提交 [英] How do the hive sql queries are submitted as mr job from hive cli

查看:627
本文介绍了如何将hive sql查询作为来自配置单元cli的mr作业提交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经部署了一个MR作为配置单元执行引擎的CDH-5.9集群。我有一个名为users的配置表格,有50行。每当我执行查询 select * from users 可以正常工作,如下所示:

 蜂房>从用户中选择*; 
OK

Adam 1 38 ATK093 CHEF
Benjamin 2 24 ATK032 SERVANT
Charles 3 45 ATK107 CASHIER
Ivy 4 30 ATK384 SERVANT
Linda 5 23 ATK132助理




所需时间:0.059秒,提取:50行

但是,在提交作为先生作业后,发出从用户选择最大值(年龄)失败。容器日志也没有任何信息来找出为什么它失败。

  hive>从用户中选择最大(年龄); 
查询ID = canballuser_20170808020101_5ed7c6b7-097f-4f5f-af68-486b45d7d4e
总计工作= 1
启动Job 1 out of 1
在编译时确定的reduce任务的数量:1
为了改变reducer的平均负载(以字节为单位):
set hive.exec.reducers.bytes.per.reducer =< number>
为了限制还原器的最大数量:
set hive.exec.reducers.max =< number>
为了设置一个固定数量的简化器:
set mapreduce.job.reduces =< number>
Starting Job = job_1501851520242_0010,跟踪URL = http:// hadoop-master:8088 / proxy / application_1501851520242_0010 /
Kill Command = /opt/cloudera/parcels/CDH-5.9.1-1.cdh5。 9.1.p0.4 / lib / hadoop / bin / hadoop作业-kill job_1501851520242_0010
Stage-1的Hadoop作业信息:映射器数量:0;减数计数:0
2017-08-08 02:01:11,472第1阶段地图= 0%,减少= 0%
结束工作= job_1501851520242_0010错误
工作中出错,调试信息...
FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.mr.MapRedTask返回代码2
MapReduce作业启动:
Stage-Stage-1 :HDFS读取:0 HDFS写入:0失败
总MapReduce CPU使用时间:0毫秒

如果我从hive cli中获得配置单元查询执行的工作流程,则可能会对我进一步调试问题有帮助。 解决方案

Hive查询执行流程中涉及很多组件。高级体系结构解释如下: https://cwiki.apache.org/confluence/display / hive / Design

本文档中有更详细的组件文档链接。



典型的查询执行流程(高级)


  1. UI调用Driver的执行接口。

  2. 驱动程序为查询创建会话句柄并将查询发送给编译器以生成执行计划。

  3. 编译器从Metastore获取必要的元数据。这个元数据被用来查询查询树中的表达式为
    ,以及基于查询谓词修剪分区。

  4. 编译器生成的计划是一个阶段的DAG,每个阶段都是HDFS上的map / reduce作业,元数据操作或
    操作。对于map / reduce阶段,该计划包含map
    操作符树(在映射器上执行的操作符树)和
    reduce操作符树(用于需要reducers的操作)。
  5. 执行引擎将这些阶段提交给适当的组件在每个任务(映射器/缩减器)中,与
    关联的解串器使用表或中间输出来读取
    HDFS文件,这些文件通过相关的运算符
    树传递。生成输出后,通过序列化程序将其写入临时的
    HDFS文件(这在发生
    的映射程序中时不需要减少)。临时文件使用
    为计划的后续地图/缩减阶段提供数据。对于DML
    操作,最终的临时文件将被移至表格的
    位置。该方案用于确保脏数据不被读取
    (文件重命名是HDFS中的一个原子操作)。

  6. 对于查询,执行引擎直接从HDFS读取临时文件的内容,作为
    驱动程序的获取调用的一部分。

Hive文档根在这里: https://cwiki.apache.org/confluence/display/Hive/Home 您可以找到有关不同组件的更多详细信息。
您也可以学习源代码以获取有关某些类实现的更多详细信息。



Hadoop作业跟踪器文档: https://wiki.apache.org/hadoop/JobTracker


I have deployed a CDH-5.9 cluster with MR as hive execution engine. I have a hive table named "users" with 50 rows. Whenever I execute the query select * from users works fine as follows :

hive> select * from users;
OK

Adam       1       38     ATK093   CHEF
Benjamin   2       24     ATK032   SERVANT
Charles    3       45     ATK107   CASHIER
Ivy        4       30     ATK384   SERVANT
Linda      5       23     ATK132   ASSISTANT 
. 
.
.

Time taken: 0.059 seconds, Fetched: 50 row(s)

But issuing select max(age) from users failed after submitting as mr job. The container log also doesn't have any information to figure it out why its getting failed.

      hive> select max(age) from users;
        Query ID = canballuser_20170808020101_5ed7c6b7-097f-4f5f-af68-486b45d7d4e
        Total jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks determined at compile time: 1
        In order to change the average load for a reducer (in bytes):
        set hive.exec.reducers.bytes.per.reducer=<number>
        In order to limit the maximum number of reducers:
        set hive.exec.reducers.max=<number>
        In order to set a constant number of reducers:
        set mapreduce.job.reduces=<number>
        Starting Job = job_1501851520242_0010, Tracking URL = http://hadoop-master:8088/proxy/application_1501851520242_0010/
        Kill Command = /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hadoop/bin/hadoop job  -kill job_1501851520242_0010
        Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
        2017-08-08 02:01:11,472 Stage-1 map = 0%,  reduce = 0%
        Ended Job = job_1501851520242_0010 with errors
        Error during job, obtaining debugging information...
        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
        MapReduce Jobs Launched:
        Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
        Total MapReduce CPU Time Spent: 0 msec

If I get the workflow of the hive query execution from hive cli, it might be helpful for me to debug the issue further.

解决方案

There are a lot of components involved in Hive query execution flow. High level architecture is explained here: https://cwiki.apache.org/confluence/display/Hive/Design

There are links in this document to more detailed component documents.

Typical query execution flow (High Level)

  1. The UI calls the execute interface to the Driver.
  2. The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.
  3. The compiler gets the necessary metadata from the metastore. This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates.
  4. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
  5. The execution engine submits these stages to appropriate components In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary HDFS file though the serializer (this happens in the mapper in case the operation does not need a reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table's location. This scheme is used to ensure that dirty data is not read (file rename being an atomic operation in HDFS).
  6. For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver .

Hive documentatio root is here :https://cwiki.apache.org/confluence/display/Hive/Home You can find more details about different components. Also you can study sources for more details about some class implementation.

Hadoop Job tracker docs: https://wiki.apache.org/hadoop/JobTracker

这篇关于如何将hive sql查询作为来自配置单元cli的mr作业提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆