Hadoop 一个 Map 和多个 Reduce [英] Hadoop one Map and multiple Reduce

查看:18
本文介绍了Hadoop 一个 Map 和多个 Reduce的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个大型数据集要使用多个 reduce 函数进行分析.

We have a large dataset to analyze with multiple reduce functions.

所有 reduce 算法都适用于由相同 ma​​p 函数生成的相同数据集.每次读取大型数据集成本太高,最好只读取一次并将映射数据传递给多个reduce函数.

All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.

我可以用 Hadoop 做到这一点吗?我搜索了示例和 intarweb,但找不到任何解决方案.

Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.

推荐答案

您是否希望每个 reducer 都处理完全相同的映射数据?但至少关键"应该不同,因为它决定了使用哪个减速器.

Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.

您可以在 mapper 中多次编写输出,并作为键输出(其中 $i 用于第 i 个 reducer,$key 是您的原始键).并且您需要添加一个分区器"以确保这 n 条记录基于 $i 分布在减速器中.然后使用GroupingComparator"按原始 $key 对记录进行分组.

You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.

这是可能的,但不是在一个 MR 中以微不足道的方式.

It's possible to do that, but not in trivial way in one MR.

这篇关于Hadoop 一个 Map 和多个 Reduce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆