在 Pig 中将元组拆分为多个元组 [英] Splitting a tuple into multiple tuples in Pig
问题描述
我喜欢从一个元组生成多个元组.我的意思是:我有包含以下数据的文件.
I like to generate multiple tuples from a single tuple. What I mean is: I have file with following data in it.
>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2
所以我通过以下命令加载它
so I load it by the following command
grunt >> A = load '$data' using PigStorage('|');
grunt >> dump A;
(ID,ColumnName1:Value1,ColumnName2:Value2)
现在我想把这个元组分成两个元组.
Now I want to split this tuple into two tuples.
(ID, ColumnName1, Value1)
(ID, ColumnName2, Value2)
我可以将 UDF 与 foreach 一起使用并生成吗?类似以下内容?
Can I use UDF along with foreach and generate. Some thing like the following?
grunt >> foreach A generate SOMEUDF(A)
输入元组:(id1,column1,column2)输出:两个元组 (id1,column1) 和 (id2,column2) 所以它是 List 还是我应该返回一个 Bag?
input tuple : (id1,column1,column2) output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
{
public List<Tuple> exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
// not sure how whether I can create tuples on my own. Looks like I should use TupleFactory.
// return list of tuples.
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
这种方法是否正确?
推荐答案
您可以编写 UDF 或使用带有内置函数的 PIG 脚本.
You could write a UDF or use a PIG script with built-in functions.
例如:
-- data should be chararray, PigStorage('|') return bytearray which will not work for this example
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
-- split by | and create a row so we can dereference it later
splt = foreach inpt generate FLATTEN(STRSPLIT($0, '\\|')) ;
-- first column is id, rest is converted into a bag and flatten it to make rows
id_vals = foreach splt generate $0 as id, FLATTEN(TOBAG(*)) as value;
-- there will be records with (id, id), but id should not have ':'
id_vals = foreach id_vals generate id, INDEXOF(value, ':') as p, STRSPLIT(value, ':', 2) as vals;
final = foreach (filter id_vals by p != -1) generate id, FLATTEN(vals) as (col, val);
dump final;
测试输入:
1|c1:11:33|c2:12
234|c1:21|c2:22
33|c1:31|c2:32
345|c1:41|c2:42
输出
(1,c1,11:33)
(1,c2,12)
(234,c1,21)
(234,c2,22)
(33,c1,31)
(33,c2,32)
(345,c1,41)
(345,c2,42)
希望能帮到你.
干杯.
这篇关于在 Pig 中将元组拆分为多个元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!