Spark 2.0-将DataFrame转换为DataSet [英] Spark 2.0 - Convert DataFrame to DataSet
问题描述
我想加载我的数据并对其进行一些基本的线性回归。因此,首先,我需要使用 VectorAssembler
来生成我的功能列。但是,当我使用 assembler.transform(df)
时, df
是 DataFrame
,并且期望使用 DataSet
。我尝试了 df.toDS
,但是它赋予 value到DS不是org.apache.spark.sql.DataFrame
。实际上,它是 org.apache.spark.sql.DatasetHolder
的成员。
I want to load my data and do some basic linear regression on it. So first, I need to use VectorAssembler
to produce my features column. However, when I use assembler.transform(df)
, df
is a DataFrame
, and it expects a DataSet
. I tried df.toDS
, but it gives value toDS is not a member of org.apache.spark.sql.DataFrame
. Indeed, it is a member of org.apache.spark.sql.DatasetHolder
.
我在这里怎么了?
package main.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DatasetHolder
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
object Analyzer {
def main(args: Array[String]) {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", "\t")
.option("parserLib", "UNIVOCITY")
.option("inferSchema", "true")
.load("data/snap/*")
val assembler = new VectorAssembler()
.setInputCols(Array("own", "want", "wish", "trade", "comment"))
.setOutputCol("features")
val df1 = assembler.transform(df)
val formula = new RFormula().setFormula("rank ~ own + want + wish + trade + comment")
.setFeaturesCol("features")
.setLabelCol("rank")
}
}
推荐答案
显然是问题所在因为我仍然使用 Spark
样式的 Spark
和 SQLContext
。我更改为 SparkSession
,并且 transform()
能够隐式接受 DataFrame
。
Apparently the problem was because I still using Spark 1.6
style of Spark
and SQLContext
. I changed for the SparkSession
, and transform()
was able to implicitly accept the DataFrame
.
package main.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
object Analyzer {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", "\t")
.option("parserLib", "UNIVOCITY")
.option("inferSchema", "true")
.load("data/snap/*")
df.show()
val assembler = new VectorAssembler()
.setInputCols(Array("own", "want", "wish", "trade", "comment"))
.setOutputCol("features")
val df1 = assembler.transform(df)
}
}
这篇关于Spark 2.0-将DataFrame转换为DataSet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!