分组后的拉丁猪袋到元组 [英] latin pig bag to tuple after group by

查看:64
本文介绍了分组后的拉丁猪袋到元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据,其架构为(t0: chararray,t1: int,t2: int)

I have the following data with schema (t0: chararray,t1: int,t2: int)

(B,4,2)
(A,2,3)
(A,3,2)
(B,2,2)
(A,1,2)
(B,1,2)

我想生成以下结果(按t0分组,按t1排序)

I'd like to generate the following results (group by t0, and ordered by t1)

(A, ((1,2),(2,3),(3,2)))
(B, ((1,2),(2,2),(4,2)))

请注意,我只希望在第二部分中使用元组,而不是袋子.请帮忙.

Please note I want only tuples in the second component, not bags. Please help.

推荐答案

您应该可以这样做.

-- A: (t0: chararray,t1: int,t2: int)

B = GROUP A BY t0 ;
C = FOREACH B {
            -- Project out the first column of A.
            projected = FOREACH A GENERATE t1, t2 ;
            -- Now you can order the projection.
            ordered = ORDER projected BY t1 ;
    GENERATE group AS t0, ordered AS vals ;
}

您可以在此处了解更多有关嵌套FOREACH的信息. >.

You can read more about nested FOREACHs here.

注意/更新::当我最初回答这个问题时,似乎错过了询问者要求输出为元组形式的部分.仅当您知道元组中字段的确切数量和位置时,才应使用元组.否则,您的架构将无法定义,并且非常很难访问这些字段.这是因为整个元组将被视为字节数组,因此您将必须手动查找并播放所有内容.

NOTE/UPDATE: It seems when I answered this question originally I missed the part where the asker asked for output to be in tuple form. Tuples should only be used when you know the exact number and position of the fields in the tuple. Otherwise then your schema will not be defined and it will be very difficult in order to access the fields. This is because the entire tuple will be treated as a bytearray, and so you will manually have to find and cast everything.

如果必须以这种方式执行此操作,则不能在纯猪中执行此操作.您必须使用某种 UDF 来执行此操作.我会推荐Python.

If you must do it this way you can not do this in pure pig. You'll have to use some sort of UDF to do this. I would recommend Python.

这篇关于分组后的拉丁猪袋到元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆