Scala-Spark:如何在循环中合并所有数据帧 [英] scala - Spark : How to union all dataframe in loop

查看:386
本文介绍了Scala-Spark:如何在循环中合并所有数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以获取将数据框循环合并的数据框?

Is there a way to get the dataframe that union dataframe in loop?

这是示例代码:

var fruits = List(
  "apple"
  ,"orange"
  ,"melon"
) 

for (x <- fruits){         
  var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}

我想获得这样的东西:

aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon

再次感谢

推荐答案

Steffen Schmitz的答案是我认为最简洁的答案. 如果您正在寻找更多的自定义项(字段类型等),则下面是更详细的答案:

Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row

//initialize DF
val schema = StructType(
  StructField("aCol", StringType, true) ::
  StructField("bCol", StringType, true) ::
  StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

//list to iterate through
var fruits = List(
    "apple"
    ,"orange"
    ,"melon"
)

for (x <- fruits) {
  //union returns a new dataset
  initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}

//initialDF.show()

参考:

  • How to create an empty DataFrame with a specified schema?
  • https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
  • https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html

这篇关于Scala-Spark:如何在循环中合并所有数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆