如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? [英] How to perform union on two DataFrames with different amounts of columns in spark?

查看：35 发布时间：2021/11/14 21:16:25 python apache-spark pyspark apache-spark-sql pyspark-dataframes

本文介绍了如何在 spark 中对具有不同列数的两个 DataFrame 执行联合?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有 2 个 DataFrame:

我需要这样的工会:

unionAll 函数不起作用，因为列的数量和名称不同.

The unionAll function doesn't work because the number and the name of columns are different.

我该怎么做?

推荐答案

在 Scala 中，您只需将所有缺失的列附加为 nulls.

In Scala you just have to append all missing columns as nulls.

import org.apache.spark.sql.functions._

// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
  (50, 2),
  (34, 4)
)).toDF("age", "children")

val df2 = sc.parallelize(List(
  (26, true, 60000.00),
  (32, false, 35000.00)
)).toDF("age", "education", "income")

val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50|       2|     null|   null|
| 34|       4|     null|   null|
| 26|    null|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

<小时>

更新

两个时态 DataFrames 将具有相同的列顺序，因为我们在这两种情况下都通过 total 进行映射.

Update

Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.

df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()

+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50|       2|     null|  null|
| 34|       4|     null|  null|
+---+--------+---------+------+

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26|    null|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

这篇关于如何在 spark 中对具有不同列数的两个 DataFrame 执行联合?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? [英] How to perform union on two DataFrames with different amounts of columns in spark?

问题描述

推荐答案

更新

Update

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? [英] How to perform union on two DataFrames with different amounts of columns in spark?

问题描述

推荐答案

更新

Update

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭