在Pig中将元组拆分为多个元组 [英] Splitting a tuple into multiple tuples in Pig

查看：120 发布时间：2018/5/31 18:41:40 hadoop apache-pig

本文介绍了在Pig中将元组拆分为多个元组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我喜欢从一个元组中生成多个元组。我的意思是：
我已经在档案中填入以下资料。

I like to generate multiple tuples from a single tuple. What I mean is: I have file with following data in it.

>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2

所以我用下面的命令加载它

so I load it by the following command

grunt >> A = load '$data' using PigStorage('|');    
grunt >> dump A;    
(ID,ColumnName1:Value1,ColumnName2:Value2)

现在我想分割这个

Now I want to split this tuple into two tuples.

(ID, ColumnName1, Value1)
(ID, ColumnName2, Value2)

可以将UDF与foreach一起使用并生成。有些事情如下？

Can I use UDF along with foreach and generate. Some thing like the following?

grunt >> foreach A generate SOMEUDF(A)

编辑：

输入元组：（id1，column1，column2）
输出：两个元组（id1，column1）和（id2，column2）所以它是List还是应该返回一个Bag？

input tuple : (id1,column1,column2) output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag?

public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
{
    public List<Tuple> exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return null;
        try{
            // not sure how whether I can create tuples on my own. Looks like I should use TupleFactory.
            // return list of tuples.
        }catch(Exception e){
            throw WrappedIOException.wrap("Caught exception processing input row ", e);
        }
    }
}

这种方法是否正确？ / p>

Is this approach correct?

推荐答案

您可以编写UDF或使用带内置函数的PIG脚本。

You could write a UDF or use a PIG script with built-in functions.

例如：

For example:

-- data should be chararray, PigStorage('|') return bytearray which will not work for this example
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);

-- split by | and create a row so we can dereference it later
splt = foreach inpt generate FLATTEN(STRSPLIT($0, '\\|')) ;

-- first column is id, rest is converted into a bag and flatten it to make rows
id_vals = foreach splt generate $0 as id, FLATTEN(TOBAG(*)) as value;
-- there will be records with (id, id), but id should not have ':'
id_vals = foreach id_vals generate id, INDEXOF(value, ':') as p, STRSPLIT(value, ':', 2) as vals;
final = foreach (filter id_vals by p != -1) generate id, FLATTEN(vals) as (col, val);
dump final;

测试输入：

1|c1:11:33|c2:12
234|c1:21|c2:22
33|c1:31|c2:32
345|c1:41|c2:42

OUTPUT

OUTPUT

(1,c1,11:33) (1,c2,12) (234,c1,21) (234,c2,22) (33,c1,31) (33,c2,32) (345,c1,41) (345,c2,42)

我希望它有帮助。

干杯。

这篇关于在Pig中将元组拆分为多个元组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Pig中将元组拆分为多个元组 [英] Splitting a tuple into multiple tuples in Pig

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

在Pig中将元组拆分为多个元组 [英] Splitting a tuple into multiple tuples in Pig

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭