通过连接两个数据集求平均值 [英] Find average by joining two datasets

查看:27
本文介绍了通过连接两个数据集求平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,

EmployeeDetail(data set 1):- 
   id  
   name
   gender
   location 

SalaryDetail(data set 2):-
   id
   salary

我需要同时计算每个地点的男性和女性的平均工资.所以我尝试了以下代码.

I need to join both and find out average salary of male and female in each location. So I tried following code .

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as 
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as 
(id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, 
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

但它抛出以下错误

<line 6, column 22>  Syntax error, unexpected symbol at or near 
'JoinedEmpDetail'

任何人都可以请帮助我在哪里做错或如何正确地做吗?

Can anyone please help where am I doing the mistake or how to do it properly?

为了更清楚地说明我的要求,我提供了一些示例数据集.

For more clarity about my requirement I am giving some sample data sets.

EmpDetail.txt

1   Biswa   Male    Bangalore
12  Bratati Mahapatra   Female  Chennai
2   Bibhu kalyan    Male    Bangalore
3   Chinta  Male    Mumbai
10  Amrit Anand Male    Bangalore
11  Sateesh panda   Male    Bangalore
4   Kirti Kumar Male    Mumbai
6   Shruthi Female  Chennai
7   Vijay   Male    Chennai
5   Bibhu   Male    Chennai
9   Bratati  Mohanty    Female  Bangalore
8   Rupa Mahapatra  Female  Bangalore
13  Salini  Female  Mumbai
14  Priyanka Chopra Female  Mumbai

EmpSalary.txt

1   10000
12  12000
2   15900
3   9000
10  8000
11  13400
4   7600
6   22000
7   17000
5   16800
9   9800
8   10000
13  11000
14  12500

我需要的最终结果是:

Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>

推荐答案

您可以使用简单的 foreach stmt 解决这个问题,所以不要使用嵌套的 foreach stmt.

You can solve this problem using simple foreach stmt so don't go for nested foreach stmt.

Group command 在嵌套的 Foreach 中不起作用,它在 pig 中受到限制.嵌套的 foreach 中只允许使用少数命令(CROSS、DISTINCT、FILTER、FOREACH、LIMIT 和 ORDER BY).

Group command will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).

你能像这样改变你的脚本吗?

Can you change your script like this?

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;

输出:

(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)

这篇关于通过连接两个数据集求平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆