如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? [英] How to perform union on two DataFrames with different amounts of columns in spark?
本文介绍了如何在 spark 中对具有不同列数的两个 DataFrame 执行联合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有 2 个 DataFrame
:
我需要这样的工会:
unionAll
函数不起作用,因为列的数量和名称不同.
The unionAll
function doesn't work because the number and the name of columns are different.
我该怎么做?
推荐答案
在 Scala 中,您只需将所有缺失的列附加为 nulls
.
In Scala you just have to append all missing columns as nulls
.
import org.apache.spark.sql.functions._
// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
(50, 2),
(34, 4)
)).toDF("age", "children")
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50| 2| null| null|
| 34| 4| null| null|
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
<小时>
更新
两个时态 DataFrames
将具有相同的列顺序,因为我们在这两种情况下都通过 total
进行映射.
Update
Both temporal DataFrames
will have the same order of columns, because we are mapping through total
in both cases.
df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()
+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50| 2| null| null|
| 34| 4| null| null|
+---+--------+---------+------+
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
这篇关于如何在 spark 中对具有不同列数的两个 DataFrame 执行联合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文