如何连接Apache中的星火code类别特征 [英] How to encode categorical features in Apache Spark
问题描述
我有一组在此基础上,我想创建一个分类模型数据。每一行都有以下形式:
USER1,class1的,产品1
用户1,1级,产品2
用户1,1级,产品5
用户2,1级,产品2
用户2,1级,产品5
用户3,类别2,产品1
有100万左右的用户,2类和1M产品。我想下一步要做的就是创建稀疏向量(由MLlib已经支持的东西),但为了应用该功能我要创建密集向量(与0),第一。换句话说,我要我的二值化数据。什么是这样做的最简单的(或最优雅的)的方式?
由于我在问候MLlib一个新手,我可能会要求您提供一个具体的例子吗?我使用MLlib 1.2。
修改
我已经结束了下面的一段code,但被证明是非常慢......但我只能用MLlib 1.2任何其他的想法?
VAL数据= test11.map(X =>((X(0),X(1)),X(2)))。groupByKey()图(X = GT;(x._1,x._2.toArray)){图X =>
VAR LT:数组[双] =新的Array [双(test12.size)
VAL ID = x._1._1
VAL CL = x._1._2
VAL DT = x._2
VAR I = -1
test12.foreach {Y =>我+ = 1; LT(I)= IF(DT含有Y)1.0 0.0其他}
VAL VS = Vectors.dense(LT)
(同上,CL,VS)
}
您需要升级到MLLib> = 1.4.0,其中几个最简单的方法是使用MLLib的的OneHotEn$c$cr 。
您第一次使用:
OneHotEn coder.categories(RDD,categoricalFields)
其中, categoricalField
是指数在此你的 RDD
包含分类数据的顺序。 类别
,给定一个数据集,这是分类变量列的索引,则返回一个结构,每个字段描述了数据集中是present值。该图的目的是用作输入到连接code方法:
OneHotEn coder.en code(RDD,类别)
它返回的矢量 RDD [数组[T]]
。
I have a set of data based on which I want to create a classification model. Each row has the following form:
user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1
There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.
EDIT
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?
val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
var lt : Array[Double] = new Array[Double](test12.size)
val id = x._1._1
val cl = x._1._2
val dt = x._2
var i = -1
test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
val vs = Vectors.dense(lt)
(id , cl , vs)
}
You need to upgrade to MLLib >= 1.4.0, where the simplest option of several is to use MLLib's OneHotEncoder.
You first use:
OneHotEncoder.categories(rdd, categoricalFields)
Where categoricalField
is the sequence of indexes at which your RDD
contains categorical data. categories
, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:
OneHotEncoder.encode(rdd, categories)
Which returns your vectorized RDD[Array[T]]
.
这篇关于如何连接Apache中的星火code类别特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!