如何将地图转换为 Spark 的 RDD [英] How to convert a map to Spark's RDD

查看:29
本文介绍了如何将地图转换为 Spark 的 RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,它是一些嵌套映射的形式,它的 Scala 类型是:

I have a data set which is in the form of some nested maps, and its Scala type is:

Map[String, (LabelType,Map[Int, Double])]

第一个 String 键是每个样本的唯一标识符,值是包含标签(-1 或 1)的元组,以及作为稀疏表示的嵌套映射与样本相关联的非零元素.

The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.

我想将此数据加载到 Spark(使用 MUtil)并训练和测试一些机器学习算法.

I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.

使用LibSVM的稀疏编码将这些数据写入文件中,然后在Spark中加载它很容易:

It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:

writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.

我知道从 data_map 直接加载 data 变量应该很容易,但我不知道如何.

I know it should be as easy to directly load the data variable from data_map, but I don't know how.

感谢任何帮助!

推荐答案

我猜你想要这样的东西

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ??? 

val trainMap = Map(
  "x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
  "x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))

val trainRdd: RDD[(String, LabeledPoint)]  = sc
  // Convert Map to Seq so it can passed to parallelize
  .parallelize(trainMap.toSeq)
  .map{case (id, (labelInt, values)) => {

      // Convert nested map to Seq so it can be passed to Vector
      val features = Vectors.sparse(nFeatures, values.toSeq)

      // Convert label to Double so it can be used for LabeledPoint
      val label = labelInt.toDouble 

      (id, LabeledPoint(label, features))
 }}

这篇关于如何将地图转换为 Spark 的 RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆