加载分类测试数据在Apache的星火稀疏矢量 [英] Load classification test data into sparse vector in Apache Spark
本文介绍了加载分类测试数据在Apache的星火稀疏矢量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在星火MLlib分类模型,其使用的训练数据构建的。现在我想用它来predict无标签的数据。
I have a classification model in Spark MLlib which was built using training data. Now I would like to use it to predict unlabeled data.
我有我的特点(无标签)的LIBSVM格式。这是我未标记的数据看怎么样
I have my features (Without the labels) in LIBSVM format. This is a sample of how my unlabeled data look like
1:1 18:1
4:1 32:1
2:1 8:1 33:1
1:1 6:1 11:1
1:1 2:1 8:1 28:1
我保存在HDFS上的一个文本文件,这些功能。我该如何加载它们在RDD [矢量]这样我就可以通过他们的模型。$ P $ pdict()?
I have these features saved in a text file on HDFS. How can I load them in RDD[Vector] so I can pass them to model.predict()?
我用Scala的编码。
I use Scala for coding.
感谢。
推荐答案
下面是考虑到指数一个基础,一个解决方案,以升序
Here is a solution considering that indices are one-based and in ascending order
// Creating dummy data similar to the one in your text file.
val data = sc.parallelize(Seq("1:1 18:1", "4:1 32:1", "2:1 8:1 33:1", "1:1 6:1 11:1", "1:1 2:1 8:1 28:1"))
// Transform the data into a pair RDD with indices and values.
val parsed = data.map(_.trim).map { line =>
val items = line.split(' ')
val (indices, values) = items.filter(_.nonEmpty).map { item =>
val indexAndValue = item.split(':')
val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
val value = indexAndValue(1).toDouble
(index, value)
}.unzip
(indices.toArray, values.toArray)
}
// Get the number of features
val numFeatures = parsed.map { case (indices, values) => indices.lastOption.getOrElse(0) }.reduce(math.max) + 1
// Create Vectors
val vectors = parsed.map { case (indices, values) => Vectors.sparse(numFeatures, indices, values) }
vectors.take(10) foreach println
// (33,[0,17],[1.0,1.0])
// (33,[3,31],[1.0,1.0])
// (33,[1,7,32],[1.0,1.0,1.0])
// (33,[0,5,10],[1.0,1.0,1.0])
// (33,[0,1,7,27],[1.0,1.0,1.0,1.0])
这篇关于加载分类测试数据在Apache的星火稀疏矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文