如何从 RDD 创建 Spark 数据集 [英] How to create a Spark Dataset from an RDD

查看：21 发布时间：2022/1/21 13:05:46 scala apache-spark dataset apache-spark-dataset

本文介绍了如何从 RDD 创建 Spark 数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 RDD[LabeledPoint] 打算在机器学习管道中使用.我们如何将 RDD 转换为 DataSet?请注意较新的 spark.ml api 需要 Dataset 格式的输入.

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.

推荐答案

这是一个遍历额外步骤的答案 - DataFrame.我们使用 SQLContext 创建一个 DataFrame，然后使用所需的对象类型创建一个 DataSet - 在本例中为 LabeledPoint:


Here is an answer that traverses an extra step - the DataFrame.   We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:
val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

更新 听说过 SparkSession 吗?(直到现在我都没有..)
Update  Ever heard of a SparkSession ?  (neither had I until now..)
所以显然 SparkSession 是 Spark 2.0.0 中的 Preferred Way (TM) 并向前发展.这是新的(火花)世界秩序的更新代码:
So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward.  Here is the updated code for the new (spark) world order:
Spark 2.0.0+ 方法 
请注意，与 SQLContext 方法相比，在以下两种方法中(其中一种更简单，其中一种归功于 @zero323)，我们已经实现了重要的节省:不再需要首先创建 数据帧.
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.
val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Spark 2.0.0+ 的第二种方式感谢@zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

传统 Spark 1.X 和更早的方法
val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

另请参阅:如何将自定义对象存储在 Dataset 中? 由受人尊敬的 @zero323 提供.
See also: How to store custom objects in Dataset? by the esteemed @zero323 .

                        这篇关于如何从 RDD 创建 Spark 数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何从 RDD 创建 Spark 数据集 [英] How to create a Spark Dataset from an RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从 RDD 创建 Spark 数据集 [英] How to create a Spark Dataset from an RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭