如何让 hive 同时运行 mapreduce 作业? [英] How to make hive run mapreduce jobs concurrently?

查看:18
本文介绍了如何让 hive 同时运行 mapreduce 作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 hive 的新手,遇到了一个问题,

I'm new to hive and I have encountered a problem,

我在蜂巢中有一张这样的桌子:

I have a table in hive like this:

create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int)  PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','  lines TERMINATED BY '
' ;  

我运行一个 sql 如下:

And I run an sql like:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1), count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1), count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1), count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1), count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1), count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1), count(distinct v1) where v4=1 or v4=5 group by v3

它有效,输出结果就是我想要的.

it works, and the output result is what I want.

但是有一个问题,hive 生成​​ 9 个 mapreduce 作业,并一一运行这些作业.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs one by one.

我对这个查询运行了解释,我得到以下消息:

I run explain on this query, and I got the following message:

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17

似乎stage 9-17对应的是mapreduce job 0-8
但是从上面的解释信息来看,阶段 10-17 只取决于阶段 9,
所以我有一个问题,为什么作业 1-8 不能同时运行?

it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

或者如何让作业 1-8 同时运行?

Or how can I make job 1-8 run concurrently?

非常感谢您的帮助!

推荐答案

在 hive-default.xml 中,有一个名为hive.exec.parallel"的属性,可以启用并行执行作业.默认值为假".您可以将其更改为true"以获取此能力.您可以使用另一个属性hive.exec.parallel.thread.number"来控制最多可以并行执行多少个作业.

In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.

更多详情:https://issues.apache.org/jira/browse/HIVE-549

这篇关于如何让 hive 同时运行 mapreduce 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆