将行值合并到列中 [英] Aggregate row value into columns

查看：116 发布时间：2018/5/31 19:56:56 hadoop apache-pig

本文介绍了将行值合并到列中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这样的数据：

  2013-11 localhost kern 
 2013-11 localhost kern 
 2013-11 192.168.0.59守护进程
 2013-12 localhost kern 
 2013-12 localhost守护进程
 2013-12 localhost邮件

你明白了。我试图按日期（作为行键）对上述进行分组，并有一列对应于每个 kern ，守护进程的计数等等。总之，我想要的输出应该如下所示：

   -  date，count（ （2013-12,1,1,1）
 （2013-11,2,1,0）
 
 
 
 
目前，我的方法就是这样。
  valid_records = FILTER formatted_records BY日期不为空; 
 
 date_group = GROUP valid_records BY日期; 
 facilities = FOREACH date_group {
  -  DUMB现在过滤的方式:( 
 kern = FILTER valid_records BY facility =='kern'; 
 user = FILTER valid_records BY facility = ='user'; 
 daemon = FILTER valid_records BY facility =='daemon'; 
 
  - 需要这样才能正确映射到HBase 
 GENERATE组，COUNT（kern），COUNT（用户），COUNT（守护进程）; 
} 
  
两个问题：
 
 
 
  我在上面有3个过滤器，但是在生产中，应该有多于10个过滤器。如果我使用了很多 FILTER 像上面那样有任何性能问题？
  
  其他更好方式做到这一点？
         寻找一个带有浮动模式的输出，但是你似乎只需要用一个复合键进行分组：
包含这个脚本：
  formatted_records = LOAD'input'AS（日期：chararray，主机：chararray，facility：chararray）; 
 valid_records = FILTER formatted_records BY日期不为空; 
 counts = FOREACH（GROUP valid_records BY（date，facility））GENERATE 
 group.date AS日期，
 group.facility AS facility，
 COUNT（valid_records）AS the_count; 
 DUMP计数; 
  
您将得到： 
 
 
 （2013-11，kern，2）
（2013-11，守护程序，1）
（2013-12，kern，1）
（2013-12 ，邮件，1）
（2013-12，daemon，1）
  
相同的信息。 
 
 
如果您想以类似于您的奇特方式格式化输出，那么最好使用通用语言（如Java或Python）这些任务是分开的（假设Pig的输出足够小以适应内存）。猪不擅长这一点。
 
I have data like this:
2013-11    localhost       kern
2013-11    localhost       kern
2013-11    192.168.0.59    daemon
2013-12    localhost       kern
2013-12    localhost       daemon
2013-12    localhost       mail
You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern, daemon, etc. In short, my desired output should be as below:
-- date, count(kern), count(daemon), count(mail)
(2013-11, 2, 1, 0)
(2013-12, 1, 1, 1)
Currently, my approach is like this.
valid_records = FILTER formatted_records BY date is not null;

date_group = GROUP valid_records BY date;
facilities = FOREACH date_group {
    -- DUMB way to filter for now :(
    kern = FILTER valid_records BY facility == 'kern';
    user = FILTER valid_records BY facility == 'user';
    daemon = FILTER valid_records BY facility == 'daemon';

    -- this need to be in order so it get mapped correctly to HBase
    GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
}
Two questions:

I have 3 filters in above, but in production, there should be more than that maybe 10 filters. Is there any performance hit if I'm using a lot of FILTER like above?
Any other better way to do this?

 解决方案 
I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key:
with this script:
formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
valid_records = FILTER formatted_records BY date is not null;
counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
        group.date AS date,
        group.facility AS facility,
        COUNT(valid_records) AS the_count;
DUMP counts;
You will get:
(2013-11,kern,2)
(2013-11,daemon,1)
(2013-12,kern,1)
(2013-12,mail,1)
(2013-12,daemon,1)
which gives just the same information.

If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.

                        这篇关于将行值合并到列中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

将行值合并到列中 [英] Aggregate row value into columns

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

将行值合并到列中 [英] Aggregate row value into columns

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭