如何将文本和数字数据的混合转换为Apache Spark中的特征数据 [英] how to convert mix of text and numerical data to feature data in apache spark

查看:272
本文介绍了如何将文本和数字数据的混合转换为Apache Spark中的特征数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本和数字数据的CSV.我需要将其转换为Spark中的特征向量数据(双精度值).有什么办法吗?

I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ?

例如,我看到每个关键字都映射到某个double值并使用该值进行转换的情况.但是,如果有多个关键字,则很难做到这一点.

I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way.

还有其他出路吗?我看到Spark提供了提取器,它将转换为特征向量.有人可以举个例子吗?

Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example?

48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K

推荐答案

最后,我这样做了.我遍历每个数据,并使用键作为每个项创建一个映射,并增加一个Double计数器.

Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter.

def createMap(data: RDD[String]) : Map[String,Double] = {  
 var mapData:Map[String,Double] = Map()
 var counter = 0.0
 data.collect().foreach{ item => 
  counter = counter +1
  mapData += (item -> counter)
 }
 mapData
}

def getLablelValue(input: String): Int = input match {
 case "<=50K" => 0
 case ">50K" => 1
}


val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd  = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct

val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)


val featureVector = census.map{line => 
  val fields = line.split(", ")
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble,  orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

这篇关于如何将文本和数字数据的混合转换为Apache Spark中的特征数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆