如何在 Apache Spark 中编码分类特征 [英] How to encode categorical features in Apache Spark

查看:34
本文介绍了如何在 Apache Spark 中编码分类特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据,我想根据这些数据创建分类模型.每行具有以下形式:

I have a set of data based on which I want to create a classification model. Each row has the following form:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

大约有 100 万用户、2 个类和 100 万个产品.我接下来想做的是创建稀疏向量(MLlib 已经支持的东西)但是为了应用该函数,我必须首先创建密集向量(带有 0).换句话说,我必须对我的数据进行二值化.最简单(或最优雅)的方法是什么?

There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?

鉴于我是 MLlib 的新手,能否请您提供一个具体的例子?我正在使用 MLlib 1.2.

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.

编辑

我最终得到了以下代码,但结果证明速度很慢......如果我只能使用 MLlib 1.2,还有其他想法吗?

I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}

推荐答案

你可以使用 spark.ml 的 OneHotEncoder.

You can use spark.ml's OneHotEncoder.

你第一次使用:

OneHotEncoder.categories(rdd, categoricalFields)

其中 categoricalField 是您的 RDD 包含分类数据的索引序列.categories,给定一个数据集和作为分类变量的列的索引,返回一个结构,对于每个字段,描述数据集中存在的值.该映射旨在用作编码方法的输入:

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

返回您的矢量化RDD[Array[T]].

这篇关于如何在 Apache Spark 中编码分类特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆