将行值合并到列中 [英] Aggregate row value into columns
问题描述
我有这样的数据:
2013-11 localhost kern
2013-11 localhost kern
2013-11 192.168.0.59守护进程
2013-12 localhost kern
2013-12 localhost守护进程
2013-12 localhost邮件
你明白了。我试图按日期(作为行键)对上述进行分组,并有一列对应于每个 kern
,守护进程的计数
等等。总之,我想要的输出应该如下所示:
- date,count( (2013-12,1,1,1)
$ c(count)(守护程序),count(邮件)
(2013-11,2,1,0)
目前,我的方法就是这样。 valid_records = FILTER formatted_records BY日期不为空;
date_group = GROUP valid_records BY日期;
facilities = FOREACH date_group {
- DUMB现在过滤的方式:(
kern = FILTER valid_records BY facility =='kern';
user = FILTER valid_records BY facility = ='user';
daemon = FILTER valid_records BY facility =='daemon';
- 需要这样才能正确映射到HBase
GENERATE组,COUNT(kern),COUNT(用户),COUNT(守护进程);
}
两个问题:
-
我在上面有3个过滤器,但是在生产中,应该有多于10个过滤器。如果我使用了很多 FILTER
像上面那样有任何性能问题?
-
其他更好方式做到这一点?
寻找一个带有浮动模式的输出,但是你似乎只需要用一个复合键进行分组:
包含这个脚本: formatted_records = LOAD'input'AS(日期:chararray,主机:chararray,facility:chararray);
valid_records = FILTER formatted_records BY日期不为空;
counts = FOREACH(GROUP valid_records BY(date,facility))GENERATE
group.date AS日期,
group.facility AS facility,
COUNT(valid_records)AS the_count;
DUMP计数;
您将得到:
(2013-11,kern,2)
(2013-11,守护程序,1)
(2013-12,kern,1)
(2013-12 ,邮件,1)
(2013-12,daemon,1)
相同的信息。
如果您想以类似于您的奇特方式格式化输出,那么最好使用通用语言(如Java或Python)这些任务是分开的(假设Pig的输出足够小以适应内存)。猪不擅长这一点。
I have data like this:
2013-11 localhost kern
2013-11 localhost kern
2013-11 192.168.0.59 daemon
2013-12 localhost kern
2013-12 localhost daemon
2013-12 localhost mail
You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern
, daemon
, etc. In short, my desired output should be as below:
-- date, count(kern), count(daemon), count(mail)
(2013-11, 2, 1, 0)
(2013-12, 1, 1, 1)
Currently, my approach is like this.
valid_records = FILTER formatted_records BY date is not null;
date_group = GROUP valid_records BY date;
facilities = FOREACH date_group {
-- DUMB way to filter for now :(
kern = FILTER valid_records BY facility == 'kern';
user = FILTER valid_records BY facility == 'user';
daemon = FILTER valid_records BY facility == 'daemon';
-- this need to be in order so it get mapped correctly to HBase
GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
}
Two questions:
I have 3 filters in above, but in production, there should be more than that maybe 10 filters. Is there any performance hit if I'm using a lot of FILTER
like above?
Any other better way to do this?
解决方案 I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key:
with this script:
formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
valid_records = FILTER formatted_records BY date is not null;
counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
group.date AS date,
group.facility AS facility,
COUNT(valid_records) AS the_count;
DUMP counts;
You will get:
(2013-11,kern,2)
(2013-11,daemon,1)
(2013-12,kern,1)
(2013-12,mail,1)
(2013-12,daemon,1)
which gives just the same information.
If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.
这篇关于将行值合并到列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!