将为 hive 中的分区表创建多少映射器和减少 [英] how many mappers and reduces will get created for a partitoned table in hive

查看:20
本文介绍了将为 hive 中的分区表创建多少映射器和减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于 hive 中的特定任务将创建多少映射器和减少器,我总是感到困惑.例如,如果块大小 = 128mb 并且有 365 个文件每个映射到一年中的某个日期(每个文件大小 = 1 mb).有基于日期列的分区.在这种情况下,在加载数据期间将运行多少个映射器和化简器?

解决方案

Mappers:

映射器的数量取决于各种因素,例如数据在节点之间的分布方式、输入格式、执行引擎和配置参数.另请参见此处:https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

MR 使用CombineInputFormat,而Tez 使用分组拆分.

特兹:

set tez.grouping.min-size=16777216;-- 16 MB 最小分割设置 tez.grouping.max-size=1073741824;-- 1 GB 最大分割

MapReduce:

set mapreduce.input.fileinputformat.split.minsize=16777216;-- 16 MB设置 mapreduce.input.fileinputformat.split.maxsize=1073741824;-- 1 GB

Mappers 也在数据所在的数据节点上运行,这就是为什么手动控制 Mapper 的数量不是一件容易的事,并不总是可以组合输入.

减速器:控制减速器的数量要容易得多.减速机数量根据

确定

mapreduce.job.reduces - 每个作业的默认缩减任务数.通常设置为接近可用主机数量的素数.当 mapred.job.tracker 为本地"时忽略.默认情况下,Hadoop 将此设置为 1,而 Hive 使用 -1 作为其默认值.通过将此属性设置为 -1,Hive 将自动计算出减速器的数量应该是多少.

hive.exec.reducers.bytes.per.reducer - Hive 0.14.0 及更早版本中的默认值为 1 GB.

Also hive.exec.reducers.max - 将使用的最大减速器数量.如果mapreduce.job.reduces 为负数,Hive 会在自动确定reducer 数量时将此作为最大reducer 数量.

所以,如果你想增加reducers的并行度,增加hive.exec.reducers.max并减少hive.exec.reducers.bytes.per.reducer>

I am always confused on how many mappers and reduces will get created for a particular task in hive. e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?

解决方案

Mappers:

Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

MR uses CombineInputFormat, while Tez uses grouped splits.

Tez:

set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split

MapReduce:

set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize=1073741824; -- 1 GB

Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.

Reducers: Controlling the number of reducers is much easier. The number of reducers determined according to

mapreduce.job.reduces - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer

这篇关于将为 hive 中的分区表创建多少映射器和减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆