使用 pig 或 hadoop 查找平均值 [英] finding mean using pig or hadoop

查看:25
本文介绍了使用 pig 或 hadoop 查找平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的表单文本文件

I have a huge text file of form

数据保存在目录data/data1.txt、data2.txt等

data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

我想做的是对每个商家,求平均金额..

What I want to do is for each merchant, find the average amount..

所以基本上最后我想将输出保存在文件中.类似的东西

so basically in the end i want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

我如何计算标准偏差?

抱歉问这么基本的问题.:(任何帮助,将不胜感激.:)

Sorry for asking such a basic question. :( Any help would be appreciated. :)

推荐答案

Apache PIG 非常适合此类任务.见示例:

Apache PIG is well adapted for such tasks. See example:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

特别注意 amnt 列的数据类型,因为它会影响 SUM 函数 PIG 将调用的实现.

Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.

PIG 还可以做一些 SQL 不能做的事情,它可以在不使用任何内部联接的情况下对每个输入行进行均值计算.如果您使用标准差计算 z 分数,这会很有用.

PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) 起作用了,现在您可以访问对组平均值、总和和计数做出贡献的原始金额.

FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.

更新 1:

计算方差和标准差:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

它将使用 2 个作业.我还没想好怎么做,嗯,需要多花点时间.

It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

这篇关于使用 pig 或 hadoop 查找平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆