如何从RDD创建Spark数据集 [英] How to create a Spark Dataset from an RDD

查看：253 发布时间：2017/4/2 12:33:34 scala apache-spark dataset apache-spark-dataset

本文介绍了如何从RDD创建Spark数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 RDD [LabeledPoint] 旨在在机器学习管道中使用。我们如何将 RDD 转换为 DataSet ？注意新的 spark.ml apis需要输入数据集格式。

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.

推荐答案

这是一个通过额外步骤的答案 - DataFrame 。我们使用 SQLContext 创建一个 DataFrame ，然后创建一个 DataSet 使用所需的对象类型 - 在这种情况下，一个 LabeledPoint ：

Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

更新没有听说过 SparkSession ？（直到现在还没有..）

Update Ever heard of a SparkSession ? (neither had I until now..)

显然， SparkSession 是首选方式（TM）在Spark 2.0.0中向前推进。以下是新（火花）世界秩序的更新代码：

So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

Spark 2.0.0+方法

注意到以下两种方法（简单的一个信用@零323），与 SQLContext 方法相比，我们已经实现了重要的节省：否需要更长时间才能先创建一个 DataFrame 。

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.

val sparkSession = SparkSession.builder().getOrCreate() val pointsTrainDf = sparkSession.createDataset(training) val model = new LogisticRegression() .train(pointsTrainDs.as[LabeledPoint])

Spark 2.0.0 +的另一种方式 @ zero323

Second way for Spark 2.0.0+ Credit to @zero323

val spark: org.apache.spark.sql.SparkSession = ??? import spark.implicits._ val trainDs = training.toDS()

传统Spark 1.X及更早版本的方法

val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0 import sqlContext.implicits._ val training = splits(0).cache() val test = splits(1) val trainDs = training**.toDS()**

另见：如何在Spark中的数据集中存储自定义对象1.6 被尊敬的@ 0323。

See also: How to store custom objects in a Dataset in Spark 1.6 by the esteemed @zero323 .

这篇关于如何从RDD创建Spark数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从RDD创建Spark数据集 [英] How to create a Spark Dataset from an RDD

问题描述

推荐答案

相关文章

其他数据库最新文章

热门教程

热门工具

登录关闭

如何从RDD创建Spark数据集 [英] How to create a Spark Dataset from an RDD

问题描述

推荐答案

相关文章

其他数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭