如何连接Apache中的星火code类别特征 [英] How to encode categorical features in Apache Spark

查看：266 发布时间：2016/5/22 15:20:00 scala apache-spark apache-spark-mllib apache-spark-1.2

本文介绍了如何连接Apache中的星火code类别特征的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组在此基础上，我想创建一个分类模型数据。每一行都有以下形式：

  USER1，class1的，产品1
用户1，1级，产品2
用户1，1级，产品5
用户2，1级，产品2
用户2，1级，产品5
用户3，类别2，产品1

有100万左右的用户，2类和1M产品。我想下一步要做的就是创建稀疏向量（由MLlib已经支持的东西），但为了应用该功能我要创建密集向量（与0），第一。换句话说，我要我的二值化数据。什么是这样做的最简单的（或最优雅的）的方式？

由于我在问候MLlib一个新手，我可能会要求您提供一个具体的例子吗？我使用MLlib 1.2。

修改

我已经结束了下面的一段code，但被证明是非常慢......但我只能用MLlib 1.2任何其他的想法？

  VAL数据= test11.map（X =＆GT;（（X（0），X（1）），X（2）））。groupByKey（）图（X = GT;（x._1，x._2.toArray））{图X =＆GT;
  VAR LT：数组[双] =新的Array [双（test12.size）
  VAL ID = x._1._1
  VAL CL = x._1._2
  VAL DT = x._2
  VAR I = -1
  test12.foreach {Y =＆GT;我+ = 1; LT（I）= IF（DT含有Y）1.0 0.0其他}
  VAL VS = Vectors.dense（LT）
  （同上，CL，VS）
}

解决方案

您需要升级到MLLib> = 1.4.0，其中几个最简单的方法是使用MLLib的的OneHotEn$c$cr 。

您第一次使用：

  OneHotEn coder.categories（RDD，categoricalFields）

其中， categoricalField 是指数在此你的 RDD 包含分类数据的顺序。 类别，给定一个数据集，这是分类变量列的索引，则返回一个结构，每个字段描述了数据集中是present值。该图的目的是用作输入到连接code方法：

  OneHotEn coder.en code（RDD，类别）

它返回的矢量 RDD [数组[T]] 。

I have a set of data based on which I want to create a classification model. Each row has the following form:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.

EDIT

I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}

解决方案

You need to upgrade to MLLib >= 1.4.0, where the simplest option of several is to use MLLib's OneHotEncoder.

You first use:

OneHotEncoder.categories(rdd, categoricalFields)

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

Which returns your vectorized RDD[Array[T]].

这篇关于如何连接Apache中的星火code类别特征的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何连接Apache中的星火code类别特征 [英] How to encode categorical features in Apache Spark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何连接Apache中的星火code类别特征 [英] How to encode categorical features in Apache Spark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭