如何从 hive cli 将 hive sql 查询作为 mr 作业提交 [英] How do the hive sql queries are submitted as mr job from hive cli

查看:50
本文介绍了如何从 hive cli 将 hive sql 查询作为 mr 作业提交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经部署了一个 CDH-5.9 集群,使用 MR 作为 hive 执行引擎.我有一个名为users"的配置单元表,有 50 行.每当我执行查询 select * from users 工作正常如下:

I have deployed a CDH-5.9 cluster with MR as hive execution engine. I have a hive table named "users" with 50 rows. Whenever I execute the query select * from users works fine as follows :

hive> select * from users;
OK

Adam       1       38     ATK093   CHEF
Benjamin   2       24     ATK032   SERVANT
Charles    3       45     ATK107   CASHIER
Ivy        4       30     ATK384   SERVANT
Linda      5       23     ATK132   ASSISTANT 
. 
.
.

Time taken: 0.059 seconds, Fetched: 50 row(s)

但是在作为 mr 作业提交后发出 select max(age) from users 失败.容器日志也没有任何信息来弄清楚它为什么失败.

But issuing select max(age) from users failed after submitting as mr job. The container log also doesn't have any information to figure it out why its getting failed.

      hive> select max(age) from users;
        Query ID = canballuser_20170808020101_5ed7c6b7-097f-4f5f-af68-486b45d7d4e
        Total jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks determined at compile time: 1
        In order to change the average load for a reducer (in bytes):
        set hive.exec.reducers.bytes.per.reducer=<number>
        In order to limit the maximum number of reducers:
        set hive.exec.reducers.max=<number>
        In order to set a constant number of reducers:
        set mapreduce.job.reduces=<number>
        Starting Job = job_1501851520242_0010, Tracking URL = http://hadoop-master:8088/proxy/application_1501851520242_0010/
        Kill Command = /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hadoop/bin/hadoop job  -kill job_1501851520242_0010
        Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
        2017-08-08 02:01:11,472 Stage-1 map = 0%,  reduce = 0%
        Ended Job = job_1501851520242_0010 with errors
        Error during job, obtaining debugging information...
        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
        MapReduce Jobs Launched:
        Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
        Total MapReduce CPU Time Spent: 0 msec

如果我从 hive cli 获得了 hive 查询执行的工作流程,这可能对我进一步调试问题有所帮助.

If I get the workflow of the hive query execution from hive cli, it might be helpful for me to debug the issue further.

推荐答案

Hive 查询执行流程涉及很多组件.此处解释了高级架构:https://cwiki.apache.org/confluence/display/蜂巢/设计

There are a lot of components involved in Hive query execution flow. High level architecture is explained here: https://cwiki.apache.org/confluence/display/Hive/Design

本文档中有更详细的组件文档的链接.

There are links in this document to more detailed component documents.

典型的查询执行流程(高级)

Typical query execution flow (High Level)

  1. UI 调用驱动程序的执行接口.
  2. 驱动程序为查询创建会话句柄并将查询发送到编译器以生成执行计划.
  3. 编译器从 Metastore 中获取必要的元数据.此元数据用于对查询树中的表达式进行类型检查,如以及根据查询谓词修剪分区.
  4. 编译器生成的计划是一个阶段的有向无环图,每个阶段要么是一个 map/reduce 作业、一个元数据操作或一个HDFS 上的操作.对于 map/reduce 阶段,计划包含 map运算符树(在映射器上执行的运算符树)和一个 reduce 操作符树(用于需要 reducers 的操作).
  5. 执行引擎将这些阶段提交给适当的组件在每个任务(mapper/reducer)中,解串器关联与表或中间输出用于从中读取行HDFS 文件,这些文件通过关联的操作符传递树.生成输出后,将其写入临时HDFS 文件通过序列化程序(这发生在映射器中,以防万一该操作不需要减少).临时文件被使用为计划的后续映射/减少阶段提供数据.对于 DML操作最终的临时文件被移动到表的地点.该方案用于保证不读取脏数据(文件重命名是 HDFS 中的原子操作).
  6. 对于查询,临时文件的内容由执行引擎直接从 HDFS 读取,作为从司机.

Hive 文档根在这里:https://cwiki.apache.org/confluence/display/Hive/Home 您可以找到有关不同组件的更多详细信息.您也可以研究有关某些类实现的更多详细信息的来源.

Hive documentatio root is here :https://cwiki.apache.org/confluence/display/Hive/Home You can find more details about different components. Also you can study sources for more details about some class implementation.

Hadoop 作业跟踪器文档:https://wiki.apache.org/hadoop/JobTracker

Hadoop Job tracker docs: https://wiki.apache.org/hadoop/JobTracker

这篇关于如何从 hive cli 将 hive sql 查询作为 mr 作业提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆