通过合并两个数据集来查找平均值 [英] Find average by joining two datasets

查看:103
本文介绍了通过合并两个数据集来查找平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,

EmployeeDetail(data set 1):- 
   id  
   name
   gender
   location 

SalaryDetail(data set 2):-
   id
   salary

我需要加入两者,并找出每个位置的男性和女性平均工资.所以我尝试了以下代码.

I need to join both and find out average salary of male and female in each location. So I tried following code .

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as 
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as 
(id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, 
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

但是它抛出了错误

<line 6, column 22>  Syntax error, unexpected symbol at or near 
'JoinedEmpDetail'

任何人都可以帮助我在哪里做错或如何正确处理吗?

Can anyone please help where am I doing the mistake or how to do it properly?

为了更加清楚我的要求,我提供了一些示例数据集.

For more clarity about my requirement I am giving some sample data sets.

EmpDetail.txt

EmpDetail.txt

1   Biswa   Male    Bangalore
12  Bratati Mahapatra   Female  Chennai
2   Bibhu kalyan    Male    Bangalore
3   Chinta  Male    Mumbai
10  Amrit Anand Male    Bangalore
11  Sateesh panda   Male    Bangalore
4   Kirti Kumar Male    Mumbai
6   Shruthi Female  Chennai
7   Vijay   Male    Chennai
5   Bibhu   Male    Chennai
9   Bratati  Mohanty    Female  Bangalore
8   Rupa Mahapatra  Female  Bangalore
13  Salini  Female  Mumbai
14  Priyanka Chopra Female  Mumbai

EmpSalary.txt

EmpSalary.txt

1   10000
12  12000
2   15900
3   9000
10  8000
11  13400
4   7600
6   22000
7   17000
5   16800
9   9800
8   10000
13  11000
14  12500

我需要的最终结果是:

Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>

推荐答案

您可以使用简单的foreach stmt解决此问题,因此不要使用嵌套的foreach stmt.

You can solve this problem using simple foreach stmt so don't go for nested foreach stmt.

Group command在嵌套的Foreach(在Pig中受限制)内不起作用.嵌套的foreach中只有很少的命令(CROSS,DISTINCT,FILTER,FOREACH,LIMIT和ORDER BY).

Group command will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).

您可以这样更改脚本吗?

Can you change your script like this?

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;

输出:

(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)

这篇关于通过合并两个数据集来查找平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆