用于分类功能的LabelEncoder? [英] LabelEncoder for categorical features?

查看:140
本文介绍了用于分类功能的LabelEncoder?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个初学者的问题,但是我已经看到很多人使用LabelEncoder()来将分类变量替换为常规变量.很多人一次通过传递多列来使用此功能,但是我对某些功能中的错误序数及其对模型的影响会产生疑问.这是一个示例:

输入

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)

输出

array([0, 1, 1, 2], dtype=int64)

如您所见,序数值未正确映射,因为我的LabelEncoder仅关心列/数组中的顺序(应该为High = 1,Med = 2,Low = 3,反之亦然).错误的映射会严重影响模型,除了OrdinalEncoder()之外,还有其他简便的方法可以正确映射这些值吗?

解决方案

简短答案:使用决策树设置其边界.在训练过程中,决策树将学习要在每个节点上设置的最佳功能以及最佳阈值,根据这些阈值,看不见的样本将沿着一个分支或另一个分支前进.

如果我们使用简单的LabelEncoder对序数特征进行编码,则可能导致特征说1代表 warm 2的特征可能转换为 hot 和一个0,它可能会转换为沸腾.在这种情况下,最佳阈值最终变得毫无意义,而不再具有较高的确定性,因为样本的较高值将使样本在该路径中采取的路径最终变得毫无意义,并且我们失去了其可预测的功效.

相反,对于编码这些功能的简单方法,可以使用 DecisionTreeClassifier ,然后查看树如何定义拆分:

from sklearn import tree

dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)

我们可以使用 plot_tree :

t = tree.plot_tree(dt, 
                   feature_names = X.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

就这些吗?好吧……是!实际上,我对功能的设置方式是,奉献时数"功能与考试是否通过之间存在一种简单而明显的联系,这个问题应该很容易建模.


现在让我们尝试通过使用我们可能通过LabelEncoder获得的编码方案直接编码所有特征来进行相同的操作,因此不理会这些特征的实际顺序,而只是随机分配一个值:

df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
             ['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20', 
              '20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
             ['A', 'C', 'F', 'D', 'B'], inplace=True)


rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result

dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)

t = tree.plot_tree(dt_wrong, 
                   feature_names = X_wrong.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

按预期,树结构比我们尝试建模的简单问题要复杂得多.为了使树正确地预测所有训练样本,它已扩展到深度4,此时单个节点就足够了.

这将暗示分类器可能过拟合,因为我们正在极大地增加其复杂性.通过修剪树并调整必要的参数以防止过度拟合,我们也无法解决问题,因为我们通过错误地编码特征而添加了太多噪声.

因此,总而言之,一旦对特征进行编码,保留特征的普遍性至关重要,否则如本例所示,我们将失去其所有可预测的功能,只需在模型中添加 noise . /p>

This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:

Input

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)

Output

array([0, 1, 1, 2], dtype=int64)

As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly?

解决方案

Short answer: Using a LabelEncoder to encode ordinal any kind of features is a bad idea!


This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label:

This transformer should be used to encode target values, i.e. y, and not the input X.

As you rightly point out in the question, mapping the inherent ordinality of a feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature).

An intuitive way to think about it, is by the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.

If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 which may translate to boiling. In such case, instead of having a higher degree of certainty in which path a sample should take due to a higher value, the optimal threshold just ends up being meaningless, and we lose its predictable power.

Instead, for a simple way of encoding these features, you could go with pd.Categorical, and obtain the categorical's codes if you want a numerical representation respecting the specified order. Or if you need it to have the fit/transform methods to replicate the transformations on unseen data (more common scenario) then you should be using an OrdinalEncoder


Though actually seeing why this is a bad idea will be more intuitive than just words.

Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:

df = pd.DataFrame(
        {'Hours of dedication': pd.Categorical(
              values =  ['25-30', '20-25', '5-10', '5-10', '40-45', 
                         '0-5', '15-20', '20-25', '30-35', '5-10',
                         '10-15', '45-50', '20-25'],
              categories=['0-5', '5-10', '10-15', '15-20', 
                          '20-25', '25-30','30-35','40-45', '45-50']),

         'Assignments avg grade': pd.Categorical(
             values =  ['B', 'C', 'F', 'C', 'B', 
                        'D', 'C', 'A', 'B', 'B', 
                        'B', 'A', 'D'],
             categories=['F', 'D', 'C', 'B','A']),

         'Result': pd.Categorical(
             values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass', 
                       'Fail', 'Fail','Pass','Pass', 'Fail', 
                       'Fail', 'Pass', 'Pass'], 
             categories=['Fail', 'Pass'])
        }
    )

The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.

So the dataframe we'll be using looks as follows:

print(df.head())

  Hours_of_dedication   Assignments_avg_grade   Result
0               20-25                       B     Pass
1               20-25                       C     Pass
2                5-10                       F     Fail
3                5-10                       C     Fail
4               40-45                       B     Pass
5                 0-5                       D     Fail
6               15-20                       C     Fail
7               20-25                       A     Pass
8               30-35                       B     Pass
9                5-10                       B     Fail

The corresponding category codes can be obtained with:

X = df.apply(lambda x: x.cat.codes)
X.head()

   Hours_of_dedication   Assignments_avg_grade   Result
0                    4                       3        1
1                    4                       2        1
2                    1                       0        0
3                    1                       2        0
4                    7                       3        1
5                    0                       1        0
6                    3                       2        0
7                    4                       4        1
8                    6                       3        1
9                    1                       3        0

Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:

from sklearn import tree

dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)

We can visualise the tree structure using plot_tree:

t = tree.plot_tree(dt, 
                   feature_names = X.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.


Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:

df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
             ['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20', 
              '20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
             ['A', 'C', 'F', 'D', 'B'], inplace=True)


rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result

dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)

t = tree.plot_tree(dt_wrong, 
                   feature_names = X_wrong.columns,
                   class_names=["Fail", "Pass"],
                   filled = True,
                   label='all',
                   rounded=True)

As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.

This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.

So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model.

这篇关于用于分类功能的LabelEncoder?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆