如何使用 spark-ml 处理分类特征? [英] How to handle categorical features with spark-ml?

查看:32
本文介绍了如何使用 spark-ml 处理分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 spark-ml 而不是 spark-mllib 处理分类数据?

How do I handle categorical data with spark-ml and not spark-mllib ?

认为文档不是很清楚,似乎分类器例如RandomForestClassifierLogisticRegression,有一个featuresCol参数,指定DataFrame中特征列的名称,和一个 labelCol 参数,它指定 DataFrame 中标记类的列的名称.

Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.

显然我想在我的预测中使用多个特征,所以我尝试使用 VectorAssembler 将我的所有特征放在 featuresCol 下的单个向量中.

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.

然而,VectorAssembler 只接受数字类型、布尔类型和向量类型(根据 Spark 网站),所以我不能将字符串放入我的特征向量中.

However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

我应该如何进行?

推荐答案

我只是想完成 Holden 的回答.

I just wanted to complete Holden's answer.

Spark 2.3.0 起,OneHotEncoder 已被弃用,并将在 3.0.0 中删除.请改用 OneHotEncoderEstimator.

Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.

Scala 中:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}

val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")

val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
  .setInputCols(Array(indexer.getOutputCol, "category2"))
  .setOutputCols(Array("category1Vec", "category2Vec"))

val pipeline = new Pipeline().setStages(Array(indexer, encoder))

pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])|
// |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])|
// |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])|
// |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+

Python 中:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])

indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])|
# |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])|
# |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])|
# |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+

Spark 1.4.0 起,MLLib 还提供 OneHotEncoder 功能,将一列标签索引映射到一列二元向量,最多只有一个值.

Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.

这种编码允许期望连续特征的算法,例如逻辑回归,使用分类特征

This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features

让我们考虑以下 DataFrame:

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
            .toDF("id", "category")

第一步是使用 StringIndexer 创建索引的 DataFrame:

The first step would be to create the indexed DataFrame with the StringIndexer:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
                   .setInputCol("category")
                   .setOutputCol("categoryIndex")
                   .fit(df)

val indexed = indexer.transform(df)

indexed.show
// +---+--------+-------------+                                                    
// | id|category|categoryIndex|
// +---+--------+-------------+
// |  0|       a|          0.0|
// |  1|       b|          2.0|
// |  2|       c|          1.0|
// |  3|       a|          0.0|
// |  4|       a|          0.0|
// |  5|       c|          1.0|
// +---+--------+-------------+

然后您可以使用 OneHotEncodercategoryIndex 进行编码:

You can then encode the categoryIndex with OneHotEncoder :

import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder()
                   .setInputCol("categoryIndex")
                   .setOutputCol("categoryVec")

val encoded = encoder.transform(indexed)

encoded.select("id", "categoryVec").show
// +---+-------------+
// | id|  categoryVec|
// +---+-------------+
// |  0|(2,[0],[1.0])|
// |  1|    (2,[],[])|
// |  2|(2,[1],[1.0])|
// |  3|(2,[0],[1.0])|
// |  4|(2,[0],[1.0])|
// |  5|(2,[1],[1.0])|
// +---+-------------+

这篇关于如何使用 spark-ml 处理分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆