拉丁猪袋到元组后分组 [英] latin pig bag to tuple after group by

查看:26
本文介绍了拉丁猪袋到元组后分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据模式 (t0: chararray,t1: int,t2: int)

(B,4,2)
(A,2,3)
(A,3,2)
(B,2,2)
(A,1,2)
(B,1,2)

我想生成以下结果(按 t0 分组,按 t1 排序)

I'd like to generate the following results (group by t0, and ordered by t1)

(A, ((1,2),(2,3),(3,2)))
(B, ((1,2),(2,2),(4,2)))

请注意,我只想要第二个组件中的元组,而不是包.请帮忙.

Please note I want only tuples in the second component, not bags. Please help.

推荐答案

你应该可以这样做.

-- A: (t0: chararray,t1: int,t2: int)

B = GROUP A BY t0 ;
C = FOREACH B {
            -- Project out the first column of A.
            projected = FOREACH A GENERATE t1, t2 ;
            -- Now you can order the projection.
            ordered = ORDER projected BY t1 ;
    GENERATE group AS t0, ordered AS vals ;
}

您可以阅读有关嵌套 FOREACHs 这里.

You can read more about nested FOREACHs here.

注意/更新:似乎当我最初回答这个问题时,我错过了提问者要求输出为元组形式的部分.仅当您知道元组中字段的确切数量和位置时,才应使用元组.否则,您的架构将不会被定义,并且访问这些字段将非常.这是因为整个元组将被视为字节数组,因此您必须手动查找和 投射一切.

NOTE/UPDATE: It seems when I answered this question originally I missed the part where the asker asked for output to be in tuple form. Tuples should only be used when you know the exact number and position of the fields in the tuple. Otherwise then your schema will not be defined and it will be very difficult in order to access the fields. This is because the entire tuple will be treated as a bytearray, and so you will manually have to find and cast everything.

如果你必须这样做,你不能在纯猪身上这样做.您必须使用某种 UDF 来执行此操作.我会推荐 Python.

If you must do it this way you can not do this in pure pig. You'll have to use some sort of UDF to do this. I would recommend Python.

这篇关于拉丁猪袋到元组后分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆