如何将org.apache.spark.ml.linalg.Vector的RDD转换为数据集? [英] How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

查看:201
本文介绍了如何将org.apache.spark.ml.linalg.Vector的RDD转换为数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力了解RDD,DataSet和DataFrame之间的转换是如何工作的. 我是Spark的新手,每次需要从一个数据模型传递到另一个模型时(特别是从RDD到Datasets和Dataframes),我都会陷入困境. 谁能解释给我正确的方法?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works. I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes). Could anyone explain me the right way to do it?

作为示例,现在我有一个RDD[org.apache.spark.ml.linalg.Vector],我需要将其传递给我的机器学习算法,例如KMeans(Spark DataSet MLlib).因此,我需要使用名为功能"的单个列将其转换为数据集,该列应包含Vector类型的行.我该怎么办?

As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?

推荐答案

要将RDD转换为数据帧,最简单的方法是在Scala中使用toDF().要使用此功能,必须导入隐式对象,该隐式对象是使用SparkSession对象完成的.可以完成以下操作:

To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

val df = rdd.toDF("features")

toDF()采用元组的RDD.当RDD由通用Scala对象构建时,它们将被隐式转换,即,无需执行任何操作,并且当RDD具有多个列时也无需执行任何操作,RDD已经包含一个元组.但是,在这种特殊情况中,您需要先将RDD[org.apache.spark.ml.linalg.Vector]转换为RDD[(org.apache.spark.ml.linalg.Vector)].因此,有必要进行如下的元组转换:

toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:

val df = rdd.map(Tuple1(_)).toDF("features")

上面的代码会将RDD转换为只有一个称为features的列的数据框.

The above will convert the RDD to a dataframe with a single column called features.

要转换为数据集,最简单的方法是使用案例类.确保案例类在Main对象之外定义.首先将RDD转换为数据帧,然后执行以下操作:

To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:

case class A(features: org.apache.spark.ml.linalg.Vector)

val ds = df.as[A]

要显示所有可能的转换,可以使用.rdd从数据框或数据集中访问基础的 RDD :

To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:

val rdd = df.rdd


与在RDD和数据帧/数据集之间来回转换相比,通常更容易使用数据帧API进行所有计算.如果没有合适的函数来执行所需的操作,通常可以定义UDF用户定义的函数.在此处查看示例: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

这篇关于如何将org.apache.spark.ml.linalg.Vector的RDD转换为数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆