使用Spark Scala计算平均值 [英] Calculate average using Spark Scala

查看：929 发布时间：2020/9/4 5:21:08 scala apache-spark join

本文介绍了使用Spark Scala计算平均值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用以下两个数据集计算Spark Scala中每个位置的平均工资?

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(第4栏是薪水)

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
Bala, 27, Doctor, 30000  
Hari, 33, Engineer, 50000  
Siva, 35, Doctor, 60000

File2.csv(第2列是位置)

File2.csv(Column 2 is location)

Hari, Bangalore  
Ram, Chennai  
Bala, Bangalore  
Siva, Chennai

以上文件未排序.需要加入这两个文件，并找到每个位置的平均工资.我尝试使用下面的代码，但无法实现.

The above files are not sorted. Need to join these 2 files and find average salary per location. I tried with below code but unable to make it.

val salary = sc.textFile("File1.csv").map(e => e.split(","))  
val location = sc.textFile("File2.csv").map(e.split(","))  
val joined = salary.map(e=>(e(0),e(3))).join(location.map(e=>(e(0),e(1)))  
val joinedData = joined.sortByKey()  
val finalData = joinedData.map(v => (v._1,v._2._1._1,v._2._2))  
val aggregatedDF = finalData.map(e=> e.groupby(e(2)).agg(avg(e(1))))    
aggregatedDF.repartition(1).saveAsTextFile("output.txt")

请提供有关代码和示例输出的外观的帮助.

Please help with code and sample output how it will look.

非常感谢

推荐答案

我将使用DataFrame API，这应该有效:

I would use DataFrame API, this should work:

val salary = sc.textFile("File1.csv")
               .map(e => e.split(","))
               .map{case Seq(name,_,_,salary) => (name,salary)}
               .toDF("name","salary")

val location = sc.textFile("File2.csv")
                 .map(e => e.split(","))
                 .map{case Seq(name,location) => (name,location)}
                 .toDF("name","location")

import org.apache.spark.sql.functions._

salary
  .join(location,Seq("name"))
  .groupBy($"location")
  .agg(
    avg($"salary").as("avg_salary")
  )
  .repartition(1)
  .write.csv("output.csv")

这篇关于使用Spark Scala计算平均值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Spark Scala计算平均值 [英] Calculate average using Spark Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Spark Scala计算平均值 [英] Calculate average using Spark Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭