将行值聚合到列中 [英] Aggregate row value into columns

查看:30
本文介绍了将行值聚合到列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的数据:

2013-11    localhost       kern
2013-11    localhost       kern
2013-11    192.168.0.59    daemon
2013-12    localhost       kern
2013-12    localhost       daemon
2013-12    localhost       mail

你懂的.我正在尝试按日期(作为行键)对上述内容进行分组,并有一列对应于每个 kerndaemon 等的计数.简而言之,我想要的输出应该如下:

You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern, daemon, etc. In short, my desired output should be as below:

-- date, count(kern), count(daemon), count(mail)
(2013-11, 2, 1, 0)
(2013-12, 1, 1, 1)

目前,我的做法是这样的.

Currently, my approach is like this.

valid_records = FILTER formatted_records BY date is not null;

date_group = GROUP valid_records BY date;
facilities = FOREACH date_group {
    -- DUMB way to filter for now :(
    kern = FILTER valid_records BY facility == 'kern';
    user = FILTER valid_records BY facility == 'user';
    daemon = FILTER valid_records BY facility == 'daemon';

    -- this need to be in order so it get mapped correctly to HBase
    GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
}

两个问题:

  1. 我在上面有 3 个过滤器,但在生产中,应该有 10 个以上的过滤器.如果我像上面那样使用大量 FILTER 会不会影响性能?

还有其他更好的方法吗?

Any other better way to do this?

推荐答案

我认为您的问题是您正在寻找具有浮动类型架构的输出.但似乎您所要做的就是按复合键分组:使用此脚本:

I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key: with this script:

formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
valid_records = FILTER formatted_records BY date is not null;
counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
        group.date AS date,
        group.facility AS facility,
        COUNT(valid_records) AS the_count;
DUMP counts;

您将获得:

(2013-11,kern,2)
(2013-11,daemon,1)
(2013-12,kern,1)
(2013-12,mail,1)
(2013-12,daemon,1)

提供相同的信息.

如果你想像你一样以一种奇特的方式格式化输出,那么最好分别使用通用语言(如 Java 或 Python)来执行此类任务(假设 Pig 的输出足够小以适合内存).猪不擅长这个.

If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.

这篇关于将行值聚合到列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆