Apache Pig:将属性列表合并到一个元组中 [英] Apache Pig: Merging list of attributes into a single tuple
问题描述
我以以下形式接收数据
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
我正在尝试将所有内容合并成一个表单,其中只有一袋id字段的元组,然后是一个包含我所有其他字段合并在一起的列表的元组.
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c ...)) (id2,(attribute2b,attribute2c ...))
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...)) (id2,(attribute2b,attribute2c...))
目前,我以类似方式获取
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
然后,我尝试了FLATTEN,TOKENIZE,GENEATE,TOTUPLE,BagConcat等的所有组合,将其按摩成所需的形状,但是我是猪的新手,只是想不通.有人可以帮忙吗?任何开源UDF库都是公平的游戏.
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.
推荐答案
将每行作为一个完整的字符串加载,然后使用内置的STRPLIT
UDF的功能来获得所需的结果.这取决于属性列表中没有选项卡,并假定在区分不同的属性时对|
和,
的对待不会有任何不同.另外,我对输入内容进行了一些修改,以显示更多边缘情况.
Load each line as an entire string, and then use the features of the built-in STRPLIT
UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that |
and ,
are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt
:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig
:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
pig -x local -p input=input.txt test.pig
的输出:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))
这篇关于Apache Pig:将属性列表合并到一个元组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!