使用Pig对大数据帧进行非规范化 [英] Use Pig to Denormalize A Large Data Frame
本文介绍了使用Pig对大数据帧进行非规范化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个大的(21GB)制表符分隔的数据框,格式为
I have a large-ish (21GB) tab-delimited data frame of the form
DOCID_1 TERMID_1 TITLE_1 YEAR_1 AUTHOR_1
DOCID_1 TERMID_2 TITLE_1 YEAR_1 AUTHOR_1
...
DOCID_n TERMID_n TITLE_n YEAR_n AUTHOR_n
也就是说,(DOCID,TERMID)对将始终唯一地标识一行.我需要的是一个数据帧,其中一个DOCID单独唯一地标识一行,而TERMID被折叠成一个逗号分隔的字符数组列表.例如,
That is, a (DOCID, TERMID) pair will always uniquely identify a row. What I need, is a data frame in which a DOCID alone uniquely identifies a row, and the TERMIDs are collapsed into a comma-separated chararray list. For example,
DOCID_1 TERMID_11, TERMID_12, ..., TERMID_n TITLE_1 YEAR_1 AUTHOR_1
...
DOCID_n TERMID_n1, TERMID_n2, ..., TERMID_n TITLE_1 YEAR_n AUTHOR_n
有人能想到在Pig中做这件事的好方法吗?
Can anyone think of a good way of doing this in Pig?
推荐答案
SEMINORMALIZED = LOAD 'so.txt' USING PigStorage(',') AS (
doc_id:chararray
,term_id:chararray
,title:chararray
,year:chararray
,author:chararray
);
KEYS = FOREACH SEMINORMALIZED GENERATE
doc_id
,term_id
;
ATTRIBUTES = FOREACH SEMINORMALIZED GENERATE
doc_id
,title
,year
,author
;
ATTRIBUTES = DISTINCT ATTRIBUTES;
GROUPED = GROUP KEYS BY doc_id;
ZNF = FOREACH GROUPED GENERATE
group AS doc_id
,KEYS.term_id; AS term_ids
DENORMALIZED = JOIN ZNF BY doc_id, ATTRIBUTES BY doc_id;
这篇关于使用Pig对大数据帧进行非规范化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文