如何将地图转换为Spark的RDD [英] How to convert a map to Spark's RDD
问题描述
我有一些嵌套地图形式的数据集,其Scala类型为:
I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
第一个String
键是每个样本的唯一标识符,其值是一个包含标签(为-1或1)和一个嵌套映射(为非零的稀疏表示)的元组.与样本相关的元素.
The first String
key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.
我想将此数据加载到Spark(使用MUtil)中,并训练和测试一些机器学习算法.
I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
使用LibSVM的稀疏编码将该数据写入文件很容易,然后将其加载到Spark中:
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
我知道直接从data_map
加载data
变量应该很容易,但是我不知道如何.
I know it should be as easy to directly load the data
variable from data_map
, but I don't know how.
感谢您的帮助!
推荐答案
我想您想要这样的东西
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ???
val trainMap = Map(
"x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
"x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))
val trainRdd: RDD[(String, LabeledPoint)] = sc
// Convert Map to Seq so it can passed to parallelize
.parallelize(trainMap.toSeq)
.map{case (id, (labelInt, values)) => {
// Convert nested map to Seq so it can be passed to Vector
val features = Vectors.sparse(nFeatures, values.toSeq)
// Convert label to Double so it can be used for LabeledPoint
val label = labelInt.toDouble
(id, LabeledPoint(label, features))
}}
这篇关于如何将地图转换为Spark的RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!