如何从 RDD 创建 Spark 数据集 [英] How to create a Spark Dataset from an RDD
问题描述
我有一个 RDD[LabeledPoint]
打算在机器学习管道中使用.我们如何将 RDD
转换为 DataSet
?请注意较新的 spark.ml
api 需要 Dataset
格式的输入.
I have an RDD[LabeledPoint]
intended to be used within a machine learning pipeline. How do we convert that RDD
to a DataSet
? Note the newer spark.ml
apis require inputs in the Dataset
format.
推荐答案
这是一个遍历额外步骤的答案 - DataFrame
.我们使用 SQLContext
创建一个 DataFrame
,然后使用所需的对象类型创建一个 DataSet
- 在本例中为 LabeledPoint代码>:
Here is an answer that traverses an extra step - the DataFrame
. We use the SQLContext
to create a DataFrame
and then create a DataSet
using the desired object type - in this case a LabeledPoint
:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
更新 听说过 SparkSession
吗?(直到现在我都没有..)
Update Ever heard of a SparkSession
? (neither had I until now..)
所以显然 SparkSession
是 Spark 2.0.0 中的 Preferred Way (TM) 并向前发展.这是新的(火花)世界秩序的更新代码:
So apparently the SparkSession
is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+ 方法
请注意,与 SQLContext
方法相比,在以下两种方法中(其中一种更简单,其中一种归功于 @zero323),我们已经实现了重要的节省:不再需要首先创建 数据帧
.
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext
approach: no longer is it necessary to first create a DataFrame
.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Spark 2.0.0+ 的第二种方式感谢@zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
传统 Spark 1.X 和更早的方法
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
另请参阅:如何将自定义对象存储在 Dataset 中? 由受人尊敬的 @zero323 提供.
See also: How to store custom objects in Dataset? by the esteemed @zero323 .
这篇关于如何从 RDD 创建 Spark 数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!