在 Mapreduce/Hadoop 中加入两个数据集 [英] Join of two datasets in Mapreduce/Hadoop

查看:28
本文介绍了在 Mapreduce/Hadoop 中加入两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道如何在 Hadoop 中实现两个数据集之间的 Natural-Join 操作吗?

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

更具体地说,这就是我真正需要做的:

More specifically, here's what I exactly need to do:

我有两组数据:

  1. 点信息存储为 (tile_number, point_id:point_info) ,这是一个 1:n 的键值对.这意味着对于每个 tile_number,可能有几个 point_id:point_info

  1. point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info

存储为 (tile_number, line_id:line_info) 的行信息,这又是一个 1:m 键值对,对于每个 tile_number,可能有多个 line_id:line_info

Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info

如您所见,两个数据集之间的 tile_numbers 相同.现在我真正需要的是根据每个 tile_number 加入这两个数据集.换句话说,对于每个 tile_number,我们有 n point_id:point_info 和 m line_id:line_info.我想要做的是将所有 point_id:point_info 对与每个 tile_number 的所有 line_id:line_info 对加入

As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number

为了澄清,这里举个例子:

In order to clarify, here's an example:

对于点对:

(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)

对于线对:

(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)

我想要的如下:

对于图块 0:

 (tile0, point0:line0)
 (tile0, point0:line1)
 (tile0, point1:line0)
 (tile0, point1:line1)

对于图块 1:

 (tile1, point1:line2)
 (tile1, point1:line3)
 (tile1, point2:line2)
 (tile1, point2:line3)

推荐答案

使用将标题作为键输出并将点/线作为值输出的映射器.您必须区分点输出值和线输出值.例如,您可以使用特殊字符(即使二进制方法会更好).

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

所以地图输出会是这样的:

So the map output will be something like:

 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...

然后,在 reducer 中,您的输入将具有以下结构:

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

你必须将点和线分开,做一个叉积并输出每对叉积,像这样:

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)
tileX (lineK, pointR)
...

如果您已经可以轻松区分点值和线值(取决于您的应用程序规范),则不需要特殊字符 (*,_)

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

关于你必须在减速器中做的叉积:您首先遍历整个值列表,将它们分成 2 个列表:

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points;
 List<String> lines;

然后使用 2 个嵌套的 for 循环进行叉积.然后遍历结果列表并为每个元素输出:

Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list

这篇关于在 Mapreduce/Hadoop 中加入两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆