如何连接Apache中的星火code类别特征 [英] How to encode categorical features in Apache Spark

查看:266
本文介绍了如何连接Apache中的星火code类别特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组在此基础上,我想创建一个分类模型数据。每一行都有以下形式:

  USER1,class1的,产品1
用户1,1级,产品2
用户1,1级,产品5
用户2,1级,产品2
用户2,1级,产品5
用户3,类别2,产品1

有100万左右的用户,2类和1M产品。我想下一步要做的就是创建稀疏向量(由MLlib已经支持的东西),但为了应用该功能我要创建密集向量(与0),第一。换句话说,我要我的二值化数据。什么是这样做的最简单的(或最优雅的)的方式?

由于我在问候MLlib一个新手,我可能会要求您提供一个具体的例子吗?我使用MLlib 1.2。

修改

我已经结束了下面的一段code,但被证明是非常慢......但我只能用MLlib 1.2任何其他的想法?

  VAL数据= test11.map(X =>((X(0),X(1)),X(2)))。groupByKey()图(X = GT;(x._1,x._2.toArray)){图X =>
  VAR LT:数组[双] =新的Array [双(test12.size)
  VAL ID = x._1._1
  VAL CL = x._1._2
  VAL DT = x._2
  VAR I = -1
  test12.foreach {Y =>我+ = 1; LT(I)= IF(DT含有Y)1.0 0.0其他}
  VAL VS = Vectors.dense(LT)
  (同上,CL​​,VS)
}


解决方案

您需要升级到MLLib> = 1.4.0,其中几个最简单的方法是使用MLLib的的One​​HotEn$c$cr

您第一次使用:

  OneHotEn coder.categories(RDD,categoricalFields)

其中, categoricalField 是指数在此你的 RDD 包含分类数据的顺序。 类别,给定一个数据集,这是分类变量列的索引,则返回一个结构,每个字段描述了数据集中是present值。该图的目的是用作输入到连接code方法:

  OneHotEn coder.en code(RDD,类别)

它返回的矢量 RDD [数组[T]]

I have a set of data based on which I want to create a classification model. Each row has the following form:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.

EDIT

I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}

解决方案

You need to upgrade to MLLib >= 1.4.0, where the simplest option of several is to use MLLib's OneHotEncoder.

You first use:

OneHotEncoder.categories(rdd, categoricalFields)

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

Which returns your vectorized RDD[Array[T]].

这篇关于如何连接Apache中的星火code类别特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆