Map Reduce:如何对两个数据集记录进行分区,以及如何使这些块进行分区 [英] Map Reduce: How can partitioning two data-sets records and how can get these blocks to make them pairs

查看:55
本文介绍了Map Reduce:如何对两个数据集记录进行分区,以及如何使这些块进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过以下操作创建Map功能:



步骤1:



我有两个数据集R和S.我想将两个数据集分成n个相等大小的块,这可以通过将每个(R / n和S / n)记录放入一个块来完成。



之后:



步骤2:然后每个可能的一对块(一个来自R,一个来自S)被划分为在Map阶段结束时的一个桶,因此可以从Reduce Function中取出一些id作为每个值对的键。例如

I want to create a Map function with the following operations:

Step 1:

I have two data sets R and S. I want to partition the two data sets into n equal-sized blocks which can be done by putting every (R/n and S/n )records into one block.

After that:

Step 2: Then every possible pair of blocks (one from R and one from S) is partitioned into a bucket at the end of Map phase so that can be taken from the Reduce Function as input with some id as key for each value pairs. e.g will be

<id:(Sij,Ril)>





所以我的问题是:



1)我可以用于第1步的任何已实现的功能吗?如何为每个数据集分别实现此操作。



2)如何在步骤2中专门参考这些数据集,以便从R中取出一个块一个来自S?



注意:在main中我定义了两个这样的数据集:



So my questions are:

1)Is there any implemented function that I can use for step 1? How implement this operation separately for each data-set.

2)How can I refer specifically to these data sets in step 2 so that I can take one block from R and one from S?

Note: In main I define the two data sets like this :

FileInputFormat.setInputPaths(conf, new Path(args[0]), new Path(args[1]));

推荐答案

这篇关于Map Reduce:如何对两个数据集记录进行分区,以及如何使这些块进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆