Hadoop的猪加入任何匹配的元组值 [英] hadoop pig joining on any matching tuple values
问题描述
我是新来的猪,并试图用它来处理的数据集。我有一组记录,看起来像
元素的id
--------------
1 [一,B,C]
2 [一,F,G]
3 [F,G,H]
的想法是,我要创建有任何重叠元素的元素的元组。如果元素只是一个单一的项目,而不是阵列,我可以做一个简单连接,如:
A = LOAD'MYDATA......
B = FOREACH一个GENERATE id作为ID_2,元素elements_2;
C =加入一个BY的元素,B BY elements_2;
不过,由于元素
是一个数组,如果只有部分重叠,这将无法工作。如何做到这猪有什么想法?
预期的产量将产生具有重叠的元组:
(1,2)
(2,3)
我不认为这是可以使用加入
这一点。
其中一个(不那么优雅)的解决方案是 CROSS
两者的关系,然后做一个过滤器
操作。
在过滤器
的条件既可以是UDF或某种regex_extract_all和生产领域的匹配。如果数组的大小总是3,我可能会去的regex_extract_all解决方案。
I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like
id elements
--------------
1 ["a","b","c"]
2 ["a","f","g"]
3 ["f","g","h"]
The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like:
A = LOAD 'mydata' ...
B = FOREACH A GENERATE id as id_2, elements as elements_2;
C = JOIN A BY elements, B BY elements_2;
But since elements
is an array, this won't work if there is only a partial overlap. Any thoughts on how to do this in pig?
The intended output would give the tuples that have overlap:
(1,2)
(2,3)
I don't think it's possible to use JOIN
for this.
One (not so elegant) solution is to CROSS
both relations and then do a FILTER
operation.
The FILTER
condition could either be a UDF or some kind of regex_extract_all and a matching of the produced fields. If the size of the array is always 3 I would probably go for the regex_extract_all solution.
这篇关于Hadoop的猪加入任何匹配的元组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!