如何从RDD创建Spark数据集 [英] How to create a Spark Dataset from an RDD
问题描述
我有一个 RDD [LabeledPoint]
旨在在机器学习管道中使用。我们如何将 RDD
转换为 DataSet
?注意新的 spark.ml
apis需要输入数据集
格式。
I have an RDD[LabeledPoint]
intended to be used within a machine learning pipeline. How do we convert that RDD
to a DataSet
? Note the newer spark.ml
apis require inputs in the Dataset
format.
推荐答案
这是一个通过额外步骤的答案 - DataFrame
。我们使用 SQLContext
创建一个 DataFrame
,然后创建一个 DataSet
使用所需的对象类型 - 在这种情况下,一个 LabeledPoint
:
Here is an answer that traverses an extra step - the DataFrame
. We use the SQLContext
to create a DataFrame
and then create a DataSet
using the desired object type - in this case a LabeledPoint
:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
更新没有听说过 SparkSession
? (直到现在还没有..)
Update Ever heard of a SparkSession
? (neither had I until now..)
显然, SparkSession
是首选方式(TM)在Spark 2.0.0中向前推进。以下是新(火花)世界秩序的更新代码:
So apparently the SparkSession
is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+方法
注意到以下两种方法(简单的一个信用@零323),与 SQLContext
方法相比,我们已经实现了重要的节省:否需要更长时间才能先创建一个 DataFrame
。
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext
approach: no longer is it necessary to first create a DataFrame
.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Spark 2.0.0 +的另一种方式 @ zero323
Second way for Spark 2.0.0+ Credit to @zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
传统Spark 1.X及更早版本的方法
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
另见:如何在Spark中的数据集中存储自定义对象1.6 被尊敬的@ 0323。
See also: How to store custom objects in a Dataset in Spark 1.6 by the esteemed @zero323 .
这篇关于如何从RDD创建Spark数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!