找到使用猪或hadoop的意思 [英] finding mean using pig or hadoop

查看:92
本文介绍了找到使用猪或hadoop的意思的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的形式文本文件

数据保存在目录data / data1.txt,data2.txt等等中。

  merchant_id,user_id,金额
1234,9123,299.2
1233,9199,203.2
1234,0124,230
等等。

我想要为每个商家找到平均值金额..

因此基本上我最终希望将输出保存在文件中。
类似于

  merchant_id,average_amount 
1234,avg_amt_1234 a
等等。

我如何计算标准偏差?

对不起,提出这样一个基本问题。 :(
任何帮助将不胜感激。)

解决方案

Apache PIG非常适合这样的任务。看例子:

  inpt = load'〜/ pig_data / pig_fun / input / group.txt'as(amnt:double,id :chararray,C2:chararray); 
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
产生组为id,总和/计数为均值,总和为总和,计为count;
};

特别注意amnt列的数据类型,因为它会影响SUM函数的实现PIG可以调用。



PIG也可以做一些SQL不能做的事情,它可以在不使用任何内部连接的情况下针对每个输入行设置平均值。如果您使用标准偏差计算z分数,那么这很有用。

  mean = foreach grp {
sum = SUM (inpt.amnt);
count = COUNT(inpt);
产生FLATTEN(inpt),总和/计数为平均值,总和为总和,计为count;
};

FLATTEN(inpt)会诀窍,现在您可以访问原始金额

更新1:

计算方差和标准差

  inpt = load'〜/ pig_data / pig_fun / input / group.txt'as(amnt:double,id:chararray,c2:chararray); 
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
产生平坦(inpt),总和/计数为avg,计数为count;
};
tmp = foreach mean {
dif =(amnt - avg)*(amnt - avg);
生成*,差为不同;
};
grp = group tmp by id;
standard_tmp = foreach grp生成flatten(tmp),SUM(tmp.dif)as sqr_sum;
standard = foreach standard_tmp generate *,sqr_sum / count作为方差,SQRT(sqr_sum / count)作为标准;

它将使用2个作业。我还没有想出如何做到这一点,嗯,需要花费更多的时间。


I have a huge text file of form

data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

What I want to do is for each merchant, find the average amount..

so basically in the end i want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

How do I calculate the standard deviation as well?

Sorry for asking such a basic question. :( Any help would be appreciated. :)

解决方案

Apache PIG is well adapted for such tasks. See example:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.

PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.

UPDATE 1:

Calculating variance and standard deviation:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

这篇关于找到使用猪或hadoop的意思的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆