Map Reduce:如何对两个数据集记录进行分区,以及如何使这些块进行分区 [英] Map Reduce: How can partitioning two data-sets records and how can get these blocks to make them pairs
问题描述
我想通过以下操作创建Map功能:
步骤1:
我有两个数据集R和S.我想将两个数据集分成n个相等大小的块,这可以通过将每个(R / n和S / n)记录放入一个块来完成。
之后:
步骤2:然后每个可能的一对块(一个来自R,一个来自S)被划分为在Map阶段结束时的一个桶,因此可以从Reduce Function中取出一些id作为每个值对的键。例如
I want to create a Map function with the following operations:
Step 1:
I have two data sets R and S. I want to partition the two data sets into n equal-sized blocks which can be done by putting every (R/n and S/n )records into one block.
After that:
Step 2: Then every possible pair of blocks (one from R and one from S) is partitioned into a bucket at the end of Map phase so that can be taken from the Reduce Function as input with some id as key for each value pairs. e.g will be
<id:(Sij,Ril)>
所以我的问题是:
1)我可以用于第1步的任何已实现的功能吗?如何为每个数据集分别实现此操作。
2)如何在步骤2中专门参考这些数据集,以便从R中取出一个块一个来自S?
注意:在main中我定义了两个这样的数据集:
So my questions are:
1)Is there any implemented function that I can use for step 1? How implement this operation separately for each data-set.
2)How can I refer specifically to these data sets in step 2 so that I can take one block from R and one from S?
Note: In main I define the two data sets like this :
FileInputFormat.setInputPaths(conf, new Path(args[0]), new Path(args[1]));
推荐答案
这篇关于Map Reduce:如何对两个数据集记录进行分区,以及如何使这些块进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!