如何在 Apache Spark 中编码分类特征 [英] How to encode categorical features in Apache Spark
问题描述
我有一组数据,我想根据这些数据创建分类模型.每行具有以下形式:
I have a set of data based on which I want to create a classification model. Each row has the following form:
user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1
大约有 100 万用户、2 个类和 100 万个产品.我接下来想做的是创建稀疏向量(MLlib 已经支持的东西)但是为了应用该函数,我必须首先创建密集向量(带有 0).换句话说,我必须对我的数据进行二值化.最简单(或最优雅)的方法是什么?
There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?
鉴于我是 MLlib 的新手,能否请您提供一个具体的例子?我正在使用 MLlib 1.2.
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.
编辑
我最终得到了以下代码,但结果证明速度很慢......如果我只能使用 MLlib 1.2,还有其他想法吗?
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?
val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
var lt : Array[Double] = new Array[Double](test12.size)
val id = x._1._1
val cl = x._1._2
val dt = x._2
var i = -1
test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
val vs = Vectors.dense(lt)
(id , cl , vs)
}
推荐答案
你可以使用 spark.ml 的 OneHotEncoder.
You can use spark.ml's OneHotEncoder.
你第一次使用:
OneHotEncoder.categories(rdd, categoricalFields)
其中 categoricalField
是您的 RDD
包含分类数据的索引序列.categories
,给定一个数据集和作为分类变量的列的索引,返回一个结构,对于每个字段,描述数据集中存在的值.该映射旨在用作编码方法的输入:
Where categoricalField
is the sequence of indexes at which your RDD
contains categorical data. categories
, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:
OneHotEncoder.encode(rdd, categories)
返回您的矢量化RDD[Array[T]]
.
这篇关于如何在 Apache Spark 中编码分类特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!