Spark:如果 DataFrame 有架构,DataFrame 如何成为 Dataset[Row] [英] Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema
问题描述
这篇文章声称Spark中的DataFrame
等价于Dataset[Row]
,但这篇博文 表明 DataFrame
具有架构.>
以博客文章中将 RDD 转换为 DataFrame
为例:如果 DataFrame
与 Dataset[Row]
相同,然后将 RDD
转换为 DataFrame
应该很简单
val rddToDF = rdd.map(value => Row(value))
但它显示的是这个
val rddStringToRowRDD = rdd.map(value => Row(value))val dfschema = StructType(Array(StructField("value",StringType)))val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)val rDDToDataSet = rddToDF.as[String]
显然,数据框实际上是行和模式的数据集.
在 Spark 2.0 中,在代码中有:type DataFrame = Dataset[Row]
它是Dataset[Row]
,只是因为定义.
Dataset
也有模式,你可以使用 printSchema()
函数打印它.通常 Spark 会推断模式,因此您不必自己编写它 - 但它仍然存在 ;)
您也可以执行 createTempView(name)
并在 SQL 查询中使用它,就像 DataFrames 一样.
换句话说,Dataset
= DataFrame from Spark 1.5
+ encoder
,将行转换为您的类.在 Spark 2.0 中合并类型后,DataFrame 就变成了 Dataset[Row]
的别名,所以没有指定编码器.
关于转换:rdd.map() 也返回 RDD
,它从不返回 DataFrame.你可以这样做:
//Dataset[Row]=DataFrame,无编码器val rddToDF = sparkSession.createDataFrame(rdd)//现在它有了信息,应该使用 String 的编码器 - 所以它变成了 Dataset[String]val rDDToDataSet = rddToDF.as[String]//但是,它可以缩短为:val 数据集 = sparkSession.createDataset(rdd)
This article claims that a DataFrame
in Spark is equivalent to a Dataset[Row]
, but this blog post shows that a DataFrame
has a schema.
Take the example in the blog post of converting an RDD to a DataFrame
: if DataFrame
were the same thing as Dataset[Row]
, then converting an RDD
to a DataFrame
should be as simple
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema.
In Spark 2.0, in code there is:
type DataFrame = Dataset[Row]
It is Dataset[Row]
, just because of definition.
Dataset
has also schema, you can print it using printSchema()
function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)
You can also do createTempView(name)
and use it in SQL queries, just like DataFrames.
In other words, Dataset
= DataFrame from Spark 1.5
+ encoder
, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row]
, so without specified encoder.
About conversions: rdd.map() also returns RDD
, it never returns DataFrame. You can do:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
这篇关于Spark:如果 DataFrame 有架构,DataFrame 如何成为 Dataset[Row]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!