使用Spark Scala计算平均值 [英] Calculate average using Spark Scala

查看:929
本文介绍了使用Spark Scala计算平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用以下两个数据集计算Spark Scala中每个位置的平均工资?

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(第4栏是薪水)

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
Bala, 27, Doctor, 30000  
Hari, 33, Engineer, 50000  
Siva, 35, Doctor, 60000

File2.csv(第2列是位置)

File2.csv(Column 2 is location)

Hari, Bangalore  
Ram, Chennai  
Bala, Bangalore  
Siva, Chennai  

以上文件未排序.需要加入这两个文件,并找到每个位置的平均工资.我尝试使用下面的代码,但无法实现.

The above files are not sorted. Need to join these 2 files and find average salary per location. I tried with below code but unable to make it.

val salary = sc.textFile("File1.csv").map(e => e.split(","))  
val location = sc.textFile("File2.csv").map(e.split(","))  
val joined = salary.map(e=>(e(0),e(3))).join(location.map(e=>(e(0),e(1)))  
val joinedData = joined.sortByKey()  
val finalData = joinedData.map(v => (v._1,v._2._1._1,v._2._2))  
val aggregatedDF = finalData.map(e=> e.groupby(e(2)).agg(avg(e(1))))    
aggregatedDF.repartition(1).saveAsTextFile("output.txt")  

请提供有关代码和示例输出的外观的帮助.

Please help with code and sample output how it will look.

非常感谢

推荐答案

我将使用DataFrame API,这应该有效:

I would use DataFrame API, this should work:

val salary = sc.textFile("File1.csv")
               .map(e => e.split(","))
               .map{case Seq(name,_,_,salary) => (name,salary)}
               .toDF("name","salary")

val location = sc.textFile("File2.csv")
                 .map(e => e.split(","))
                 .map{case Seq(name,location) => (name,location)}
                 .toDF("name","location")

import org.apache.spark.sql.functions._

salary
  .join(location,Seq("name"))
  .groupBy($"location")
  .agg(
    avg($"salary").as("avg_salary")
  )
  .repartition(1)
  .write.csv("output.csv")

这篇关于使用Spark Scala计算平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆