Apache Pig:将属性列表合并为一个元组 [英] Apache Pig: Merging list of attributes into a single tuple

查看:25
本文介绍了Apache Pig:将属性列表合并为一个元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以表格形式接收数据

id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..

我正在尝试将其全部合并到一个表单中,其中我只有一包 id 字段的元组,然后是一个包含合并在一起的所有其他字段列表的元组.

I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.

(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...))(id2,(attribute2b,attribute2c...))

(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...)) (id2,(attribute2b,attribute2c...))

目前我喜欢它

my_data = load '$input' USING PigStorage(|) as 
(id:chararray, attribute1:chararray, attribute2:chararray)...

然后我尝试了 FLATTEN、TOKENIZE、GENERATE、TOTUPLE、BagConcat 等的所有组合,将其按摩成我想要的形式,但我是 Pig 的新手,只是无法弄清楚.任何人都可以帮忙吗?任何开源 UDF 库都是公平的游戏.

then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.

推荐答案

将每一行作为一个完整的字符串加载,然后利用内置的 STRPLIT UDF 的特性来达到预期的效果.这依赖于您的属性列表中没有选项卡,并假设 |, 在分离不同的属性时不会被区别对待.此外,我稍微修改了您的输入以显示更多边缘情况.

Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.

input.txt:

id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a

test.pig:

my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;

pig -x local -p input=input.txt test.pig 的输出:

(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))

这篇关于Apache Pig:将属性列表合并为一个元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆