将行值合并到列中 [英] Aggregate row value into columns

查看:116
本文介绍了将行值合并到列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的数据:

  2013-11 localhost kern 
2013-11 localhost kern
2013-11 192.168.0.59守护进程
2013-12 localhost kern
2013-12 localhost守护进程
2013-12 localhost邮件

你明白了。我试图按日期(作为行键)对上述进行分组,并有一列对应于每个 kern 守护进程的计数等等。总之,我想要的输出应该如下所示:

   -  date,count( (2013-12,1,1,1)
(2013-11,2,1,0)




目前,我的方法就是这样。

  valid_records = FILTER formatted_records BY日期不为空; 

date_group = GROUP valid_records BY日期;
facilities = FOREACH date_group {
- DUMB现在过滤的方式:(
kern = FILTER valid_records BY facility =='kern';
user = FILTER valid_records BY facility = ='user';
daemon = FILTER valid_records BY facility =='daemon';

- 需要这样才能正确映射到HBase
GENERATE组,COUNT(kern),COUNT(用户),COUNT(守护进程);
}

两个问题:


  1. 我在上面有3个过滤器,但是在生产中,应该有多于10个过滤器。如果我使用了很多 FILTER 像上面那样有任何性能问题?


  2. 其他更好方式做到这一点?

  3. 寻找一个带有浮动模式的输出,但是你似乎只需要用一个复合键进行分组:
    包含这个脚本:

      formatted_records = LOAD'input'AS(日期:chararray,主机:chararray,facility:chararray); 
    valid_records = FILTER formatted_records BY日期不为空;
    counts = FOREACH(GROUP valid_records BY(date,facility))GENERATE
    group.date AS日期,
    group.facility AS facility,
    COUNT(valid_records)AS the_count;
    DUMP计数;

    您将得到:

     (2013-11,kern,2)
    (2013-11,守护程序,1)
    (2013-12,kern,1)
    (2013-12 ,邮件,1)
    (2013-12,daemon,1)

    相同的信息。

    如果您想以类似于您的奇特方式格式化输出,那么最好使用通用语言(如Java或Python)这些任务是分开的(假设Pig的输出足够小以适应内存)。猪不擅长这一点。


    I have data like this:

    2013-11    localhost       kern
    2013-11    localhost       kern
    2013-11    192.168.0.59    daemon
    2013-12    localhost       kern
    2013-12    localhost       daemon
    2013-12    localhost       mail
    

    You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern, daemon, etc. In short, my desired output should be as below:

    -- date, count(kern), count(daemon), count(mail)
    (2013-11, 2, 1, 0)
    (2013-12, 1, 1, 1)
    

    Currently, my approach is like this.

    valid_records = FILTER formatted_records BY date is not null;
    
    date_group = GROUP valid_records BY date;
    facilities = FOREACH date_group {
        -- DUMB way to filter for now :(
        kern = FILTER valid_records BY facility == 'kern';
        user = FILTER valid_records BY facility == 'user';
        daemon = FILTER valid_records BY facility == 'daemon';
    
        -- this need to be in order so it get mapped correctly to HBase
        GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
    }
    

    Two questions:

    1. I have 3 filters in above, but in production, there should be more than that maybe 10 filters. Is there any performance hit if I'm using a lot of FILTER like above?

    2. Any other better way to do this?

    解决方案

    I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key: with this script:

    formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
    valid_records = FILTER formatted_records BY date is not null;
    counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
            group.date AS date,
            group.facility AS facility,
            COUNT(valid_records) AS the_count;
    DUMP counts;
    

    You will get:

    (2013-11,kern,2)
    (2013-11,daemon,1)
    (2013-12,kern,1)
    (2013-12,mail,1)
    (2013-12,daemon,1)
    

    which gives just the same information.

    If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.

    这篇关于将行值合并到列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆