如何在 spark ml 中处理决策树、随机森林的分类特征? [英] How to handle categorical features for Decision Tree, Random Forest in spark ml?

查看:54
本文介绍了如何在 spark ml 中处理决策树、随机森林的分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 UCI 银行营销数据上构建决策树和随机森林分类器 -> https://archive.ics.uci.edu/ml/datasets/bank+marketing.数据集中有许多分类特征(具有字符串值).

I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.

在 spark ml 文档中,提到可以通过使用 StringIndexer 或 VectorIndexer 进行索引将分类变量转换为数字.我选择使用 StringIndexer(向量索引需要向量特征和向量汇编器,将特征转换为向量特征只接受数字类型).使用这种方法,分类特征的每个级别都将根据其频率分配数值(0 表示类别特征的最频繁标签).

In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).

我的问题是随机森林或决策树的算法如何理解新特征(源自分类特征)与连续变量不同.索引特征在算法中会被视为连续的吗?这是正确的方法吗?或者我应该继续对分类特征使用 One-Hot-Encoding.

My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.

我阅读了这个论坛的一些答案,但最后一部分我没有说清楚.

I read some of the answers from this forum but i didn't get clarity on the last part.

推荐答案

应为类别 > 2 的分类变量进行一次热编码.

要理解为什么,您应该知道分类数据的子类别之间的区别:Ordinal dataNominal data.

To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.

序数数据:这些值之间有某种排序.例子:客户反馈(优秀、好、中立、差、非常差).如您所见,它们之间有明确的顺序(优秀 > 良好 > 中性 > 差 > 非常差).在这种情况下,仅 StringIndexer 就足以用于建模.

Ordinal Data: The values has some sort of ordering between them. example: Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.

名义数据:这些值之间没有定义的顺序.例如:颜色(黑色,蓝色,白色,...).在这种情况下,仅 StringIndexer 还不够.String Indexing之后需要One Hot Encoding.

Nominal Data: The values has no defined ordering between them. example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.

String Indexing 之后让我们假设输出是:

After String Indexing lets assume the output is:

 id | colour   | categoryIndex
----|----------|---------------
 0  | black    | 0.0
 1  | white    | 1.0
 2  | yellow   | 2.0
 3  | red      | 3.0

如果没有One Hot Encoding,机器学习算法将假设:red >黄色 >白色 >黑色,我们知道这不是真的.OneHotEncoder() 将帮助我们避免这种情况.

Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true. OneHotEncoder() will help us avoid this situation.

为了回答你的问题

索引特征在算法中会被认为是连续的吗?

Will indexed feature be considered as continuous in the algorithm?

它将被视为连续变量.

这是正确的方法吗?或者我应该继续使用 One-Hot-Encoding用于分类特征

Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features

取决于你对数据的理解.虽然随机森林和一些boosting方法不需要OneHot Encoding,但大多数ML算法都需要它.

depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.

参考:https://spark.apache.org/docs/latest/ml-features.html#onehotencoder

这篇关于如何在 spark ml 中处理决策树、随机森林的分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆