Hadoop的猪加入任何匹配的元组值 [英] hadoop pig joining on any matching tuple values

查看：108 发布时间：2016/6/3 22:22:01 arrays join hadoop apache-pig

本文介绍了Hadoop的猪加入任何匹配的元组值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新来的猪，并试图用它来处理的数据集。我有一组记录，看起来像

 元素的id
--------------
1 [一，B，C]
2 [一，F，G]
3 [F，G，H]

的想法是，我要创建有任何重叠元素的元素的元组。如果元素只是一个单一的项目，而不是阵列，我可以做一个简单连接，如：

  A = LOAD'MYDATA......
B = FOREACH一个GENERATE id作为ID_2，元素elements_2;
C =加入一个BY的元素，B BY elements_2;

不过，由于元素是一个数组，如果只有部分重叠，这将无法工作。如何做到这猪有什么想法？

预期的产量将产生具有重叠的元组：

 （1,2）
（2,3）

解决方案

我不认为这是可以使用加入这一点。
其中一个（不那么优雅）的解决方案是 CROSS 两者的关系，然后做一个过滤器操作。
在过滤器的条件既可以是UDF或某种regex_extract_all和生产领域的匹配。如果数组的大小总是3，我可能会去的regex_extract_all解决方案。

I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like

id    elements
--------------
1     ["a","b","c"]
2     ["a","f","g"]
3     ["f","g","h"]

The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like:

A = LOAD 'mydata' ...
B = FOREACH A GENERATE id as id_2, elements as elements_2;
C = JOIN A BY elements, B BY elements_2;

But since elements is an array, this won't work if there is only a partial overlap. Any thoughts on how to do this in pig?

The intended output would give the tuples that have overlap:

(1,2)
(2,3)

解决方案

I don't think it's possible to use JOIN for this. One (not so elegant) solution is to CROSS both relations and then do a FILTER operation. The FILTER condition could either be a UDF or some kind of regex_extract_all and a matching of the produced fields. If the size of the array is always 3 I would probably go for the regex_extract_all solution.

这篇关于Hadoop的猪加入任何匹配的元组值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Hadoop的猪加入任何匹配的元组值 [英] hadoop pig joining on any matching tuple values

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hadoop的猪加入任何匹配的元组值 [英] hadoop pig joining on any matching tuple values

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭