Spark Union所有多个数据框 [英] Spark unionAll multiple dataframes

查看：496 发布时间：2020/9/3 23:34:52 scala apache-spark apache-spark-sql

本文介绍了Spark Union所有多个数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于一组数据框

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

我要团结所有的人

df1.unionAll(df2).unionAll(df3)

对于任何数量的数据帧(例如来自

Is there a more elegant and scalable way of doing this for any number of dataframes, for example from

Seq(df1, df2, df3)

推荐答案

最简单的解决方案是将reduce与union(在Spark< 2.0中为unionAll):

The simplest solution is to reduce with union (unionAll in Spark < 2.0):

val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)

这是相对简洁的，不应该从堆外存储中移出数据，而是使用每个联合来扩展沿袭需要非线性时间来执行计划分析.如果您尝试合并大量的DataFrames，可能会出现问题.

This is relatively concise and shouldn't move data from off-heap storage ~~but extends lineage with each union~~ requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames.

您还可以转换为RDDs并使用SparkContext.union:

You can also convert to RDDs and use SparkContext.union:

dfs match {
  case h :: Nil => Some(h)
  case h :: _   => Some(h.sqlContext.createDataFrame(
                     h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
                     h.schema
                   ))
  case Nil  => None
}

它使~~血统短~~的分析成本较低，但比直接合并DataFrames效率低.

It keeps ~~lineage short~~ analysis cost low but otherwise it is less efficient than merging DataFrames directly.

这篇关于Spark Union所有多个数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Union所有多个数据框 [英] Spark unionAll multiple dataframes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark Union所有多个数据框 [英] Spark unionAll multiple dataframes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭