在 Mapreduce/Hadoop 中加入两个数据集 [英] Join of two datasets in Mapreduce/Hadoop

查看：28 发布时间：2022/1/14 8:05:59 hadoop join mapreduce distributed

本文介绍了在 Mapreduce/Hadoop 中加入两个数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人知道如何在 Hadoop 中实现两个数据集之间的 Natural-Join 操作吗?

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

更具体地说，这就是我真正需要做的:

More specifically, here's what I exactly need to do:

我有两组数据:

点信息存储为 (tile_number, point_id:point_info) ，这是一个 1:n 的键值对.这意味着对于每个 tile_number，可能有几个 point_id:point_info

point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info

存储为 (tile_number, line_id:line_info) 的行信息，这又是一个 1:m 键值对，对于每个 tile_number，可能有多个 line_id:line_info

Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info

如您所见，两个数据集之间的 tile_numbers 相同.现在我真正需要的是根据每个 tile_number 加入这两个数据集.换句话说，对于每个 tile_number，我们有 n point_id:point_info 和 m line_id:line_info.我想要做的是将所有 point_id:point_info 对与每个 tile_number 的所有 line_id:line_info 对加入

As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number

为了澄清，这里举个例子:

In order to clarify, here's an example:

对于点对:

(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)

对于线对:

(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)

我想要的如下:

对于图块 0:

 (tile0, point0:line0)
 (tile0, point0:line1)
 (tile0, point1:line0)
 (tile0, point1:line1)

对于图块 1:

 (tile1, point1:line2)
 (tile1, point1:line3)
 (tile1, point2:line2)
 (tile1, point2:line3)

推荐答案

使用将标题作为键输出并将点/线作为值输出的映射器.您必须区分点输出值和线输出值.例如，您可以使用特殊字符(即使二进制方法会更好).

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

所以地图输出会是这样的:

So the map output will be something like:

 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...

然后，在 reducer 中，您的输入将具有以下结构:

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

你必须将点和线分开，做一个叉积并输出每对叉积，像这样:

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)
tileX (lineK, pointR)
...

如果您已经可以轻松区分点值和线值(取决于您的应用程序规范)，则不需要特殊字符 (*,_)

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

关于你必须在减速器中做的叉积:您首先遍历整个值列表，将它们分成 2 个列表:

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points;
 List<String> lines;

然后使用 2 个嵌套的 for 循环进行叉积.然后遍历结果列表并为每个元素输出:

Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list

这篇关于在 Mapreduce/Hadoop 中加入两个数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Mapreduce/Hadoop 中加入两个数据集 [英] Join of two datasets in Mapreduce/Hadoop

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Mapreduce/Hadoop 中加入两个数据集 [英] Join of two datasets in Mapreduce/Hadoop

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭