Spark - Scala - 用来自另一个数据帧的查找值替换数据帧中的值 [英] Spark - Scala - Replacing value in dataframe with lookup value from another data frame
问题描述
我在 Databricks 上使用 Spark.编程语言是 Scala.
I'm working with Spark on Databricks. The programming language is Scala.
我有两个数据框:
我想:
- 查找主数据框中Age"==-1 的所有行
- 查看该行的title"值
- 在数据框 2 中查看拥有此头衔的人的平均年龄是多少
- 使用此值更新主数据框中的年龄.
我对如何做到这一点感到很困惑.我唯一想到的是将数据帧存储为数据块中的表并使用 SQL 语句(sql.Context.Sql...),这最终变得非常复杂.
I've wrecked my head on how to do this. The only thing I came up with was storing the dataframe as a table in databricks and using SQL statements (sql.Context.Sql...), which ended up being very complicated.
我想知道是否有更有效的方法来做到这一点.
I'm wondering if there's a more efficient way of doing this.
添加可重现的示例
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("Fred", 20, "Intern"), ("Linda", -1, "Manager"), ("Sean", 23, "Junior Employee"), ("Walter", 35, "Manager"), ("Kate", -1, "Junior Employee"), ("Kathrin", 37, "Manager"), ("Bob", 16, "Intern"), ("Lukas", 24, "Junionr Employee")))
.toDF("Name", "Age", "Title")
println("Data Frame DF")
df.show();
val avgAge = df.filter("Age!=-1").groupBy("Title").agg(avg("Age").alias("avg_age")).toDF()
println("Average Ages")
avgAge.show()
println("Missing Age")
val noAge = df.filter("Age==-1").toDF()
noAge.show()
感谢 Karol Sudol 的解决方案
val imputedAges = df.filter("Age == -1").join(avgAge, Seq("Title")).select(col("Name"),col("avg_age"), col("Title") )
imputedAges.show()
val finalDF= imputedAges.union(df.filter("Age!=-1"))
println("FinalDF")
finalDF.show()
推荐答案
val df = dfMain.filter("age == -1").join(dfLookUp, Seq("title")).select(col("title"), col("avg"), ......)
如果您想保留任何其他值,请在下一步使用 left/right/outer join
和 main DF
.
use left/right/outer join
on the next step with main DF
if you want to retain any other values.
浏览教程:databricks 培训
这篇关于Spark - Scala - 用来自另一个数据帧的查找值替换数据帧中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!