加载分类测试数据在Apache的星火稀疏矢量 [英] Load classification test data into sparse vector in Apache Spark

查看:147
本文介绍了加载分类测试数据在Apache的星火稀疏矢量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在星火MLlib分类模型,其使用的训练数据构建的。现在我想用它来predict无标签的数据。

I have a classification model in Spark MLlib which was built using training data. Now I would like to use it to predict unlabeled data.

我有我的特点(无标签)的LIBSVM格式。这是我未标记的数据看怎么样

I have my features (Without the labels) in LIBSVM format. This is a sample of how my unlabeled data look like

1:1  18:1
4:1  32:1
2:1  8:1  33:1
1:1  6:1  11:1
1:1  2:1  8:1  28:1

我保存在HDFS上的一个文本文件,这些功能。我该如何加载它们在RDD [矢量]这样我就可以通过他们的模型。$ ​​P $ pdict()?

I have these features saved in a text file on HDFS. How can I load them in RDD[Vector] so I can pass them to model.predict()?

我用Scala的编码。

I use Scala for coding.

感谢。

推荐答案

下面是考虑到指数一个基础,一个解决方案,以升序

Here is a solution considering that indices are one-based and in ascending order

// Creating dummy data similar to the one in your text file.
val data = sc.parallelize(Seq("1:1  18:1", "4:1  32:1", "2:1  8:1  33:1", "1:1  6:1  11:1", "1:1  2:1  8:1  28:1"))

// Transform the data into a pair RDD with indices and values.
val parsed = data.map(_.trim).map { line =>
  val items = line.split(' ')
  val (indices, values) = items.filter(_.nonEmpty).map { item =>
    val indexAndValue = item.split(':')
    val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
  val value = indexAndValue(1).toDouble
    (index, value)
  }.unzip

  (indices.toArray, values.toArray)
}

// Get the number of features
val numFeatures = parsed.map { case (indices, values) => indices.lastOption.getOrElse(0) }.reduce(math.max) + 1
// Create Vectors
val vectors = parsed.map { case (indices, values) => Vectors.sparse(numFeatures, indices, values) }

vectors.take(10) foreach println

// (33,[0,17],[1.0,1.0])
// (33,[3,31],[1.0,1.0])
// (33,[1,7,32],[1.0,1.0,1.0])
// (33,[0,5,10],[1.0,1.0,1.0])
// (33,[0,1,7,27],[1.0,1.0,1.0,1.0])

这篇关于加载分类测试数据在Apache的星火稀疏矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆