通过连接两个数据集求平均值 [英] Find average by joining two datasets
问题描述
我有两个数据集,
EmployeeDetail(data set 1):-
id
name
gender
location
SalaryDetail(data set 2):-
id
salary
我需要同时计算每个地点的男性和女性的平均工资.所以我尝试了以下代码.
I need to join both and find out average salary of male and female in each location. So I tried following code .
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as
(id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group,
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
但它抛出以下错误
<line 6, column 22> Syntax error, unexpected symbol at or near
'JoinedEmpDetail'
任何人都可以请帮助我在哪里做错或如何正确地做吗?
Can anyone please help where am I doing the mistake or how to do it properly?
为了更清楚地说明我的要求,我提供了一些示例数据集.
For more clarity about my requirement I am giving some sample data sets.
EmpDetail.txt
1 Biswa Male Bangalore
12 Bratati Mahapatra Female Chennai
2 Bibhu kalyan Male Bangalore
3 Chinta Male Mumbai
10 Amrit Anand Male Bangalore
11 Sateesh panda Male Bangalore
4 Kirti Kumar Male Mumbai
6 Shruthi Female Chennai
7 Vijay Male Chennai
5 Bibhu Male Chennai
9 Bratati Mohanty Female Bangalore
8 Rupa Mahapatra Female Bangalore
13 Salini Female Mumbai
14 Priyanka Chopra Female Mumbai
EmpSalary.txt
1 10000
12 12000
2 15900
3 9000
10 8000
11 13400
4 7600
6 22000
7 17000
5 16800
9 9800
8 10000
13 11000
14 12500
我需要的最终结果是:
Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
推荐答案
您可以使用简单的 foreach stmt
解决这个问题,所以不要使用嵌套的 foreach stmt.
You can solve this problem using simple foreach stmt
so don't go for nested foreach stmt.
Group command
在嵌套的 Foreach 中不起作用,它在 pig 中受到限制.嵌套的 foreach 中只允许使用少数命令(CROSS、DISTINCT、FILTER、FOREACH、LIMIT 和 ORDER BY).
Group command
will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).
你能像这样改变你的脚本吗?
Can you change your script like this?
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;
输出:
(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)
这篇关于通过连接两个数据集求平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!