从DataFrame到RDD [LabeledPoint] [英] From DataFrame to RDD[LabeledPoint]

查看:90
本文介绍了从DataFrame到RDD [LabeledPoint]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Apache Spark MLlib实现文档分类器,但是在表示数据时遇到了一些问题.我的代码如下:

I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following:

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF

val sql = new SQLContext(sc)

// Load raw data from a TSV file
val raw = sc.textFile("data.tsv").map(_.split("\t").toSeq)

// Convert the RDD to a dataframe
val schema = StructType(List(StructField("class", StringType), StructField("content", StringType)))
val dataframe = sql.createDataFrame(raw.map(row => Row(row(0), row(1))), schema)

// Tokenize
val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)

// TF-IDF
val htf = new HashingTF().setInputCol("tokens").setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)
tf.cache
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(tf)
val tfidf = idfModel.transform(tf)

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.get(4)))

我需要使用数据帧来生成令牌并创建TF-IDF功能.当我尝试将此数据帧转换为RDD [LabeledPoint]时出现问题.我映射了数据框行,但是Row的get方法返回Any类型,而不是数据框架构(向量)上定义的类型.因此,我无法构建需要训练ML模型的RDD.

I need to use dataframes to generate the tokens and create the TF-IDF features. The problem appears when I try to convert this dataframe to a RDD[LabeledPoint]. I map the dataframe rows, but the get method of Row return an Any type, not the type defined on the dataframe schema (Vector). Therefore, I cannot construct the RDD I need to train a ML model.

在计算TF-IDF之后获得RDD [LabeledPoint]的最佳选择是什么?

What is the best option to get a RDD[LabeledPoint] after calculating a TF-IDF?

推荐答案

投射对象对我有用.

尝试:

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector]))

这篇关于从DataFrame到RDD [LabeledPoint]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆