Hadoop Map Reduce 链接中的数据共享 [英] Data sharing in Hadoop Map Reduce chaining

查看:16
本文介绍了Hadoop Map Reduce 链接中的数据共享的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在连续的reducer和mapper之间共享一个值?

Is it possible to share a value between successive reducer and mapper?

或者是否可以将第一个 reducer 的输出存储到内存中,第二个 mapper 可以从内存中访问它?

Or is it possible to store the output of first reducer into memory and second mapper can access that from memory ?

问题是,我写了一个链式映射减速器,比如 Map1 -> Reducer1 --> Map2 --> Reducer2.

Problem is , I had written a chain map reducer like Map1 -> Reducer1 --> Map2 --> Reducer2.

Map1 和 Map2 正在读取同一个输入文件.

Map1 and Map2 is reading the same input file.

Reduce1 正在导出一个值,假设 'X' 作为其输出.

Reduce1 is deriving a value suppose 'X' as its output.

我需要X"和 Map2 的输入文件.

I need 'X' and input file for Map2.

如果不读取 Reduce1 的输出文件,我们如何做到这一点?

How can we do this without reading the output file of Reduce1?

是否可以在内存中存储X"以供 Mapper 2 访问?

Is it possible store 'X' in memory to access for Mapper 2 ?

推荐答案

每个作业彼此独立,因此如果不将输出存储在中间位置,就不可能跨作业共享数据.

Each job is independent of each other, so without storing the output in intermediate location it's not possible to share the data across jobs.

仅供参考,在 MapReduce 模型中,地图任务不会相互通信.减少任务也是如此.在 Hadoop 上运行的 Apache Giraph 使用同一作业中的映射器之间的通信,用于需要相同的迭代算法在映射器之间没有通信的情况下一次又一次地运行作业.

FYI, in MapReduce model the map tasks don't talk to each other. Same is the case for reduce tasks also. Apache Giraph which runs on Hadoop uses communication between the mappers in the same job for iterative algorithms which requires the same job to be run again and again without communication between the mappers.

不确定正在实现的算法以及为什么使用 MR,但每个 MR 算法都可以在 BSP 也.这是一篇比较 BSP 和 MR 的论文.与 MR 相比,一些算法在 BSP 中表现良好.Apache Hama 是 BSP 模型的实现,就像 Apache Hadoop 是 MR 的实现一样.

Not sure about the algorithm being implemented and why MR, but every MR algorithm can be implemented in BSP also. Here is a paper comparing BSP with MR. Some of the algorithms perform well in BSP when compared to MR. Apache Hama is an implementation of the BSP model, the way Apache Hadoop is an implementation of MR.

这篇关于Hadoop Map Reduce 链接中的数据共享的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆