什么是星火的``unionAll` DataFrame`回事? [英] What is going wrong with `unionAll` of Spark `DataFrame`?

查看:382
本文介绍了什么是星火的``unionAll` DataFrame`回事?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用星火1.5.0,并给予下列code,我希望unionAll工会数据帧基于其列名S。在code,我使用了一些FunSuite用于传递SparkContext SC

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:

object Entities {

  case class A (a: Int, b: Int)
  case class B (b: Int, a: Int)

  val as = Seq(
    A(1,3),
    A(2,4)
  )

  val bs = Seq(
    B(5,3),
    B(6,4)
  )
}

class UnsortedTestSuite extends SparkFunSuite {

  configuredUnitTest("The truth test.") { sc =>
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val aDF = sc.parallelize(Entities.as, 4).toDF
    val bDF = sc.parallelize(Entities.bs, 4).toDF
    aDF.show()
    bDF.show()
    aDF.unionAll(bDF).show
  }
}

为什么doea结果

Why doea the result

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
+---+---+

+---+---+
|  b|  a|
+---+---+
|  5|  3|
|  6|  4|
+---+---+

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
|  5|  3|
|  6|  4|
+---+---+

包含的混合b和一个列,而不是列名对齐列基地?听起来像一个严重错误!?

contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?

推荐答案

它看起来并不像一个错误都没有。你看到的是一个标准的SQL行为和每一个主要RDMBS,包括 PostgreSQL的,的MySQL 甲骨文和<一个HREF =htt​​p://sqlfiddle.com/#!6/c4649/1> MS SQL 的行为如出一辙。你会发现用名字链接SQL小提琴的例子。

It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.

要报价 PostgreSQL的手动

为了计算集,交集或两个查询的不同,这两个查询必须是工会兼容,这意味着它们将返回相同的列数和相应的列具有兼容的数据类型。

In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types

列名,不包括在设定操作的第一个表,会被忽略。

Column names, excluding the first table in the set operation, are simply ignored.

此行​​为直接来形成的关系代数,其中基本建设集团是一个元组。由于元组是有序的两套元组的联合是等价(忽略重复处理),以你在这里输出。

This behavior comes directly form the Relational Algebra where basic building bloc is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.

如果你想使用的名字来匹配你可以做这样的事情。

If you want to match using names you can do something like this

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
  val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
  a.select(columns: _*).unionAll(b.select(columns: _*))
}

要检查这两个名字和类型也应该足以取代

To check both names and types it is should be enough to replace columns with:

a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq

这篇关于什么是星火的``unionAll` DataFrame`回事?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆