如何在 Apache Spark 中编码分类特征 [英] How to encode categorical features in Apache Spark

查看：34 发布时间：2021/11/14 20:59:39 scala apache-spark apache-spark-mllib apache-spark-1.2

本文介绍了如何在 Apache Spark 中编码分类特征的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组数据，我想根据这些数据创建分类模型.每行具有以下形式:

I have a set of data based on which I want to create a classification model. Each row has the following form:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

大约有 100 万用户、2 个类和 100 万个产品.我接下来想做的是创建稀疏向量(MLlib 已经支持的东西)但是为了应用该函数，我必须首先创建密集向量(带有 0).换句话说，我必须对我的数据进行二值化.最简单(或最优雅)的方法是什么?

There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?

鉴于我是 MLlib 的新手，能否请您提供一个具体的例子?我正在使用 MLlib 1.2.

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.

编辑

我最终得到了以下代码，但结果证明速度很慢......如果我只能使用 MLlib 1.2，还有其他想法吗?

I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}

推荐答案

你可以使用 spark.ml 的 OneHotEncoder.

You can use spark.ml's OneHotEncoder.

你第一次使用:

OneHotEncoder.categories(rdd, categoricalFields)

其中 categoricalField 是您的 RDD 包含分类数据的索引序列.categories，给定一个数据集和作为分类变量的列的索引，返回一个结构，对于每个字段，描述数据集中存在的值.该映射旨在用作编码方法的输入:

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

返回您的矢量化RDD[Array[T]].

这篇关于如何在 Apache Spark 中编码分类特征的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 Apache Spark 中编码分类特征 [英] How to encode categorical features in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在 Apache Spark 中编码分类特征 [英] How to encode categorical features in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭