Spark Union所有多个数据框 [英] Spark unionAll multiple dataframes
问题描述
对于一组数据框
val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")
我要团结所有的人
df1.unionAll(df2).unionAll(df3)
对于任何数量的数据帧(例如来自
Is there a more elegant and scalable way of doing this for any number of dataframes, for example from
Seq(df1, df2, df3)
推荐答案
最简单的解决方案是将reduce
与union
(在Spark< 2.0中为unionAll
):
The simplest solution is to reduce
with union
(unionAll
in Spark < 2.0):
val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)
这是相对简洁的,不应该从堆外存储中移出数据,而是使用每个联合来扩展沿袭需要非线性时间来执行计划分析.如果您尝试合并大量的DataFrames
,可能会出现问题.
This is relatively concise and shouldn't move data from off-heap storage but extends lineage with each union requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames
.
您还可以转换为RDDs
并使用SparkContext.union
:
You can also convert to RDDs
and use SparkContext.union
:
dfs match {
case h :: Nil => Some(h)
case h :: _ => Some(h.sqlContext.createDataFrame(
h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
h.schema
))
case Nil => None
}
它使血统短的分析成本较低,但比直接合并DataFrames
效率低.
It keeps lineage short analysis cost low but otherwise it is less efficient than merging DataFrames
directly.
这篇关于Spark Union所有多个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!