Spark `DataFrame` 的 `unionAll` 出了什么问题? [英] What is going wrong with `unionAll` of Spark `DataFrame`?

查看:38
本文介绍了Spark `DataFrame` 的 `unionAll` 出了什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Spark 1.5.0 并给出以下代码,我希望 unionAll 根据列名联合 DataFrame .在代码中,我使用了一些 FunSuite 来传入 SparkContext sc:

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:

object Entities {

  case class A (a: Int, b: Int)
  case class B (b: Int, a: Int)

  val as = Seq(
    A(1,3),
    A(2,4)
  )

  val bs = Seq(
    B(5,3),
    B(6,4)
  )
}

class UnsortedTestSuite extends SparkFunSuite {

  configuredUnitTest("The truth test.") { sc =>
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val aDF = sc.parallelize(Entities.as, 4).toDF
    val bDF = sc.parallelize(Entities.bs, 4).toDF
    aDF.show()
    bDF.show()
    aDF.unionAll(bDF).show
  }
}

输出:

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
+---+---+

+---+---+
|  b|  a|
+---+---+
|  5|  3|
|  6|  4|
+---+---+

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
|  5|  3|
|  6|  4|
+---+---+

为什么结果包含混合的b"和a"列,而不是根据列名对齐列?听起来像一个严重错误!?

Why does the result contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?

推荐答案

这看起来根本不像是一个错误.您看到的是标准的 SQL 行为和每个主要的 RDMBS,包括 PostgreSQLMySQLOracleMS SQL 的行为完全相同.您会发现与名称相关联的 SQL Fiddle 示例.

It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.

引用 PostgreSQL 手册:

为了计算两个查询的并集、交集或差值,两个查询必须是并集兼容"的,也就是说它们返回的列数相同,并且对应的列具有兼容的数据类型

In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types

列名,不包括设置操作中的第一个表,将被简单地忽略.

Column names, excluding the first table in the set operation, are simply ignored.

这种行为直接来自关系代数,其中基本构建块是一个元组.由于元组是有序的,因此两组元组的并集等效于(忽略重复处理)您在此处获得的输出.

This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.

如果你想使用名称匹配,你可以这样做

If you want to match using names you can do something like this

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
  val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
  a.select(columns: _*).unionAll(b.select(columns: _*))
}

要检查名称和类型,将 columns 替换为:

To check both names and types it is should be enough to replace columns with:

a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq

这篇关于Spark `DataFrame` 的 `unionAll` 出了什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆