用于分类特征的 LabelEncoder? [英] LabelEncoder for categorical features?

查看:41
本文介绍了用于分类特征的 LabelEncoder?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个初学者问题,但我看到很多人使用 LabelEncoder() 将分类变量替换为序数.很多人通过一次传递多个列来使用此功能,但是我对某些功能中的顺序错误以及它将如何影响我的模型有些怀疑.下面是一个例子:

输入

将pandas导入为pd将 numpy 导入为 np从 sklearn.preprocessing 导入 LabelEncodera = pd.DataFrame(['高','低','低','中'])le = LabelEncoder()le.fit_transform(a)

输出

array([0, 1, 1, 2], dtype=int64)

如您所见,序号值未正确映射,因为我的 LabelEncoder 只关心列/数组中的顺序(应该是 High=1、Med=2、Low=3,反之亦然).错误的映射对模型的影响有多大,除了 OrdinalEncoder() 之外,还有其他简单的方法可以正确映射这些值吗?

解决方案

TL;DR:使用

就这些??嗯……是的!实际上我已经设置了这样的功能,即奉献时间功能与是否通过考试之间存在这种简单而明显的关系,明确表示这个问题应该很容易建模.


现在让我们尝试通过使用我们可以获得的编码方案直接编码所有特征来做同样的事情,例如通过 LabelEncoder 获得,因此忽略特征的实际序数,而只是分配一个值随机:

df_wrong = df.copy()df_wrong['Hours_of_dedication'].cat.set_categories(['0-5','40-45','25-30','10-15','5-10','45-50','15-20','20-25','30-35'],就地=真)df_wrong['Assignments_avg_grade'].cat.set_categories(['A','C','F','D','B'],就地=真)


rcParams['figure.figsize'] = 14,18X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)y = df_wrong.Resultdt_wrong = tree.DecisionTreeClassifier()dt_wrong.fit(X_wrong, y)t = tree.plot_tree(dt_wrong,feature_names = X_wrong.columns,class_names=[失败",通过"],填充 = 真,标签='所有',四舍五入=真)

正如预期的那样,树结构比我们试图建模的简单问题所需要的复杂得多.为了让树正确预测所有训练样本,它已经扩展到 4 的深度,此时单个节点就足够了.

这意味着分类器可能会过度拟合,因为我们正在大幅增加复杂性.通过修剪树和调整必要的参数以防止过度拟合,我们也没有解决问题,因为我们通过错误地编码特征添加了太多的噪音.

总而言之,一旦对特征进行编码就保留它们的序数是至关重要的,否则正如本示例所阐明的那样,我们将失去它们所有的可预测能力,只会向我们的模型添加噪声.

This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:

Input

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)

Output

array([0, 1, 1, 2], dtype=int64)

As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly?

解决方案

TL;DR: Using a LabelEncoder to encode ordinal any kind of features is a bad idea!


This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label:

This transformer should be used to encode target values, i.e. y, and not the input X.

As you rightly point out in the question, mapping the inherent ordinality of an ordinal feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature). And the same applies to a categorical feature, just that the original feature has no ordinality.

An intuitive way to think about it, is in the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.

If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 representing boiling. In such case, the result will end up being a tree with an unnecessarily high amount of splits, and hence a much higher complexity for what should be simpler to model.

Instead, the right approach would be to use an OrdinalEncoder, and define the appropriate mapping schemes for the ordinal features. Or in the case of having a categorical feature, we should be looking at OneHotEncoder or the various encoders available in Category Encoders.


Though actually seeing why this is a bad idea will be more intuitive than just words.

Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:

df = pd.DataFrame(
        {'Hours of dedication': pd.Categorical(
              values =  ['25-30', '20-25', '5-10', '5-10', '40-45', 
                         '0-5', '15-20', '20-25', '30-35', '5-10',
                         '10-15', '45-50', '20-25'],
              categories=['0-5', '5-10', '10-15', '15-20', 
                          '20-25', '25-30','30-35','40-45', '45-50']),

         'Assignments avg grade': pd.Categorical(
             values =  ['B', 'C', 'F', 'C', 'B', 
                        'D', 'C', 'A', 'B', 'B', 
                        'B', 'A', 'D'],
             categories=['F', 'D', 'C', 'B','A']),

         'Result': pd.Categorical(
             values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass', 
                       'Fail', 'Fail','Pass','Pass', 'Fail', 
                       'Fail', 'Pass', 'Pass'], 
             categories=['Fail', 'Pass'])
        }
    )

The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.

So the dataframe we'll be using looks as follows:

print(df.head())

  Hours_of_dedication   Assignments_avg_grade   Result
0               20-25                       B     Pass
1               20-25                       C     Pass
2                5-10                       F     Fail
3                5-10                       C     Fail
4               40-45                       B     Pass
5                 0-5                       D     Fail
6               15-20                       C     Fail
7               20-25                       A     Pass
8               30-35                       B     Pass
9                5-10                       B     Fail

The corresponding category codes can be obtained with:

X = df.apply(lambda x: x.cat.codes)
X.head()

   Hours_of_dedication   Assignments_avg_grade   Result
0                    4                       3        1
1                    4                       2        1
2                    1                       0        0
3                    1                       2        0
4                    7                       3        1
5                    0                       1        0
6                    3                       2        0
7                    4                       4        1
8                    6                       3        1
9                    1                       3        0

Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:

from sklearn import tree

dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)

We can visualise the tree structure using plot_tree:

t = tree.plot_tree(dt, 
                   feature_names = X.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.


Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:

df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
             ['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20', 
              '20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
             ['A', 'C', 'F', 'D', 'B'], inplace=True)


rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result

dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)

t = tree.plot_tree(dt_wrong, 
                   feature_names = X_wrong.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.

This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.

So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model.

这篇关于用于分类特征的 LabelEncoder?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆