通过两个数据集与Hadoop进行映射 [英] Mapping through two data sets with Hadoop

查看:109
本文介绍了通过两个数据集与Hadoop进行映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个键值数据集 - 数据集A和B,我们称之为它们。我想用集合B中的数据更新集合A中的所有数据,其中两个匹配键。

因为我正在处理如此大量的数据,所以我使用Hadoop来MapReduce。我担心的是,为了在A和B之间进行这种关键匹配,我需要将所有集合A(大量数据)加载到每个映射器实例的内存中。这似乎相当低效。



有没有推荐的方法可以做到这一点,每次都不需要重复加载工作?



一些伪代码来澄清我目前正在做的事情:

 在数据集A中加载#这看起来很贵步骤总是做
数据集B中的Foreach键/值:
如果键在数据集A中:
更新数据集A


解决方案

根据文档,MapReduce框架包含以下步骤


  1. 地图



  2. 合并(可选)
  3. / ol>

    您已经描述了一种执行连接的方法:将每个映射器中的所有集合A加载到内存中。你是正确的,这是低效的。 然而,如果两个集合都是按键进行排序和分区的,则可以观察到大连接可以分割成任意数量的较小连接。 MapReduce在上面的步骤(2)中通过键对每个映射器的输出进行排序。 Sorted Map输出然后按键分区,以便每个Reducer创建一个分区。对于每个唯一键,Reducer将接收来自Set A和Set B的所有值。

    要完成加入,Reducer只需输出密钥,如果存在,则从集合B更新值;否则,从Set A输出密钥和原始值。要区分来自Set A和Set B的值,请尝试在Mapper的输出值上设置一个标志。


    Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys.

    Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient.

    Would there be a recommended way to do this that doesn't require repeating the work of loading in A every time?

    Some pseudcode to clarify what I'm currently doing:

    Load in Data Set A # This seems like the expensive step to always be doing
    Foreach key/value in Data Set B:
       If key is in Data Set A:
          Update Data Seta A
    

    解决方案

    According to the documentation, the MapReduce framework includes the following steps:

    1. Map
    2. Sort/Partition
    3. Combine (optional)
    4. Reduce

    You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.

    Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.

    To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

    这篇关于通过两个数据集与Hadoop进行映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆