使用Spark Scala计算平均值 [英] Calculate average using Spark Scala
本文介绍了使用Spark Scala计算平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何使用以下两个数据集计算Spark Scala中每个位置的平均工资?
How do I calculate the Average salary per location in Spark Scala with below two data sets ?
File1.csv(第4栏是薪水)
File1.csv(Column 4 is salary)
Ram, 30, Engineer, 40000
Bala, 27, Doctor, 30000
Hari, 33, Engineer, 50000
Siva, 35, Doctor, 60000
File2.csv(第2列是位置)
File2.csv(Column 2 is location)
Hari, Bangalore
Ram, Chennai
Bala, Bangalore
Siva, Chennai
以上文件未排序.需要加入这两个文件,并找到每个位置的平均工资.我尝试使用下面的代码,但无法实现.
The above files are not sorted. Need to join these 2 files and find average salary per location. I tried with below code but unable to make it.
val salary = sc.textFile("File1.csv").map(e => e.split(","))
val location = sc.textFile("File2.csv").map(e.split(","))
val joined = salary.map(e=>(e(0),e(3))).join(location.map(e=>(e(0),e(1)))
val joinedData = joined.sortByKey()
val finalData = joinedData.map(v => (v._1,v._2._1._1,v._2._2))
val aggregatedDF = finalData.map(e=> e.groupby(e(2)).agg(avg(e(1))))
aggregatedDF.repartition(1).saveAsTextFile("output.txt")
请提供有关代码和示例输出的外观的帮助.
Please help with code and sample output how it will look.
非常感谢
推荐答案
我将使用DataFrame API,这应该有效:
I would use DataFrame API, this should work:
val salary = sc.textFile("File1.csv")
.map(e => e.split(","))
.map{case Seq(name,_,_,salary) => (name,salary)}
.toDF("name","salary")
val location = sc.textFile("File2.csv")
.map(e => e.split(","))
.map{case Seq(name,location) => (name,location)}
.toDF("name","location")
import org.apache.spark.sql.functions._
salary
.join(location,Seq("name"))
.groupBy($"location")
.agg(
avg($"salary").as("avg_salary")
)
.repartition(1)
.write.csv("output.csv")
这篇关于使用Spark Scala计算平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文