通过合并两个数据集来查找平均值 [英] Find average by joining two datasets
问题描述
我有两个数据集,
EmployeeDetail(data set 1):-
id
name
gender
location
SalaryDetail(data set 2):-
id
salary
我需要加入两者,并找出每个位置的男性和女性平均工资.所以我尝试了以下代码.
I need to join both and find out average salary of male and female in each location. So I tried following code .
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as
(id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group,
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
但是它抛出了错误
<line 6, column 22> Syntax error, unexpected symbol at or near
'JoinedEmpDetail'
任何人都可以帮助我在哪里做错或如何正确处理吗?
Can anyone please help where am I doing the mistake or how to do it properly?
为了更加清楚我的要求,我提供了一些示例数据集.
For more clarity about my requirement I am giving some sample data sets.
EmpDetail.txt
EmpDetail.txt
1 Biswa Male Bangalore
12 Bratati Mahapatra Female Chennai
2 Bibhu kalyan Male Bangalore
3 Chinta Male Mumbai
10 Amrit Anand Male Bangalore
11 Sateesh panda Male Bangalore
4 Kirti Kumar Male Mumbai
6 Shruthi Female Chennai
7 Vijay Male Chennai
5 Bibhu Male Chennai
9 Bratati Mohanty Female Bangalore
8 Rupa Mahapatra Female Bangalore
13 Salini Female Mumbai
14 Priyanka Chopra Female Mumbai
EmpSalary.txt
EmpSalary.txt
1 10000
12 12000
2 15900
3 9000
10 8000
11 13400
4 7600
6 22000
7 17000
5 16800
9 9800
8 10000
13 11000
14 12500
我需要的最终结果是:
Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
推荐答案
您可以使用简单的foreach stmt
解决此问题,因此不要使用嵌套的foreach stmt.
You can solve this problem using simple foreach stmt
so don't go for nested foreach stmt.
Group command
在嵌套的Foreach(在Pig中受限制)内不起作用.嵌套的foreach中只有很少的命令(CROSS,DISTINCT,FILTER,FOREACH,LIMIT和ORDER BY).
Group command
will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).
您可以这样更改脚本吗?
Can you change your script like this?
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;
输出:
(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)
这篇关于通过合并两个数据集来查找平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!