Spark:如果DataFrame具有架构,DataFrame如何成为Dataset [Row] [英] Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema

查看:141
本文介绍了Spark:如果DataFrame具有架构,DataFrame如何成为Dataset [Row]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.

以博客文章中将RDD转换为DataFrame的示例为例:如果DataFrameDataset[Row]相同,则将RDD转换为DataFrame应该很简单

Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple

val rddToDF = rdd.map(value => Row(value))

但是相反,它表明就是这个

But instead it shows that it's this

val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]

很明显,数据框实际上是行和架构的数据集.

Clearly a dataframe is actually a dataset of rows and a schema.

推荐答案

在Spark 2.0中,代码中包含: type DataFrame = Dataset[Row]

In Spark 2.0, in code there is: type DataFrame = Dataset[Row]

Dataset[Row],只是因为定义.

Dataset也具有架构,您可以使用printSchema()函数进行打印.通常,Spark会推断模式,因此您不必自己编写它-但它仍然存在;)

Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)

您也可以执行createTempView(name)并将其用于SQL查询中,就像DataFrames一样.

You can also do createTempView(name) and use it in SQL queries, just like DataFrames.

换句话说,Dataset = DataFrame from Spark 1.5 + encoder,将行转换为您的类.在Spark 2.0中合并类型之后,DataFrame只是Dataset[Row]的别名,因此无需指定编码器.

In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.

关于转换:rdd.map()也返回RDD,它从不返回DataFrame.您可以这样做:

About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:

// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]

// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)

这篇关于Spark:如果DataFrame具有架构,DataFrame如何成为Dataset [Row]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆