Hadoop - reducer如何获取数据? [英] Hadoop - How does reducer gets it data?

查看:259
本文介绍了Hadoop - reducer如何获取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,映射器每个reducer产生1个分区。 reducer如何知道要复制哪个分区?可以说有2个节点运行mapper for word count程序,并配置了2个reducer。如果每个映射节点生成2个分区,并且可能在两个节点中都包含与键相同的单词,那么reducer将如何正确工作?



例如:



如果节点1产生分区1和分区2,分区1包含名为WHO的密钥。

如果节点2生成分区3和分区4,并且分区3包含名为WHO的键。



如果分区1和分区4进入了减速器1(剩余的减速器2),减速器1如何计算正确的字数?



如果这不是可能性,那么分区1和分区3将被放入reducer 1,Hadoop如何执行此操作?它是否确保来自不同节点的给定键值对始终与同一个缩减器相同?如果是这样,它是怎么做到的?



谢谢,
Suresh。

解决方案

在你的情况下,由于分区1和分区3都带有'WHO'这个关键字,所以可以保证这两个分区到达同一个reducer。



< h3> Update

在hadoop中,tasktracker在任何时候的最大减少任务数由 mapred.tasktracker.reduce决定。 tasks.maximum 属性。

MapReduce作业的缩减器数量通过 -D mapred.reduce.tasks = n $ b


当存在多个缩减器时,映射任务对其输出进行分区,每个分区为每个reduce任务创建一个分区。每个分区中可以有多个键(及其相关的值),但是任何给定键的记录都在单个分区中。分区可以通过用户定义的分区功能来控制,但通常默认分区程序(使用散列函数对分区键进行分区)工作得很好。(Hadoop:权威指南)

因此,具有指定键的值总是会到同一个缩减器。


I understand that the mapper produces 1 partition per reducer. How does the reducer know which partition to copy? Lets say there are 2 nodes running mapper for word count program and there are 2 reducers configured. If each map node produces 2 partitions, with the possibility of partitions in both the nodes containing same word as key, how will the reducer work correctly?

For ex:

If node 1 produces partition 1 and partition 2, and partition 1 contains a key named "WHO".

If node 2 produces partition 3 and partition 4, and partition 3 contains a key named "WHO".

If Partition 1 and Partition 4 went to reducer 1 (and remaining to reducer 2), how does the reducer 1 compute the correct word count?

If this is not a possibility, and partition 1 and 3 would be made to go to reducer 1, how Hadoop does this? Does it make sure a given key-value pair from different nodes always go to a same reducer? If so, how it does this?

Thanks, Suresh.

解决方案

In your situation, since partition 1 and partition 3 both with the key 'WHO', it is guaranteed that the two partitions went to the same reducer.

Update

In hadoop, the max number of reduce tasks one a tasktracker at any one time is determined by the mapred.tasktracker.reduce.tasks.maximum property.
And the number of reducers for a MapReduce job is set via -D mapred.reduce.tasks=n

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.(Hadoop: The definitive guide)

So, the value with a specified key would always go to the same reducer.

这篇关于Hadoop - reducer如何获取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆