OneHotEncoding 映射 [英] OneHotEncoding Mapping

查看:50
本文介绍了OneHotEncoding 映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了离散化分类特征,我使用了 LabelEncoder 和 OneHotEncoder.我知道 LabelEncoder 按字母顺序映射数据,但 OneHotEncoder 如何映射数据?

To discretize categorical features I'm using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data?

我有一个 Pandas 数据框,dataFeat 有 5 个不同的列和 4 个可能的标签,如上所示.dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']]

I have a pandas dataframe, dataFeat with 5 different columns, and 4 possible labels, like above. dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']]

Feat1  Feat2  Feat3  Feat4  Feat5
  A      B      A      A      A
  B      B      C      C      C
  D      D      A       A     B
  C      C      A       A     A  

我像这样应用labelencoder

le = preprocessing.LabelEncoder()

intIndexed = dataFeat.apply(le.fit_transform)

这是标签编码器的编码方式

This is how the labels are encoded by the LabelEncoder

Label   LabelEncoded
 A         0
 B         1
 C         2
 D         3

然后我像这样应用 OneHotEncoder

I then apply a OneHotEncoder like this

enc = OneHotEncoder(sparse = False)

encModel = enc.fit(intIndexed)

dataFeatY = encModel.transform(intIndexed)

intIndexed.shape = 94,5dataFeatY.shape=94,20 .

我对 dataFeatY 的形状有点困惑——它不应该也是 95,5 吗?

I am a bit confused with the shape of dataFeatY - shouldn't it also be 95,5?

按照下面的 MhFarahani 回答,我这样做是为了查看标签是如何映射的

Following MhFarahani answer below, I have done this to see how labels are mapped

import numpy as np

S = np.array(['A', 'B','C','D'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)

[0 1 2 3]

ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot.T)

[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]]

这是否意味着标签是这样映射的,还是每列都不同?(这可以解释形状为 94,20)

Does this mean that labels are mapped like this, or is it different for each column ? (which would explain the shape being 94,20)

Label   LabelEncoded    OneHotEncoded
 A         0               1.  0.  0.  0
 B         1               0.  1.  0.  0.
 C         2               0.  0.  1.  0.
 D         3               0.  0.  0.  1.

推荐答案

一种热编码意味着您可以创建 1 和 0 的向量.所以顺序无关紧要.在sklearn中,首先需要将分类数据编码为数值数据,然后将它们提供给OneHotEncoder,例如:

One hot encoding means that you create vectors of one and zero. So the order does not matter. In sklearn, first you need to encode the categorical data to numerical data and then feed them to the OneHotEncoder, for example:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

S = np.array(['b','a','c'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot)

导致:

[1 0 2]

[[ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]

但是pandas直接转换分类数据:

But pandas directly convert the categorical data:

import pandas as pd
S = pd.Series( {'A': ['b', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)

输出:

A    [b, a, c]
dtype: object

   a  b  c
0  0  1  0
1  1  0  0
2  0  0  1

正如您在映射过程中所看到的,为每个分类特征创建了一个向量.向量的元素在分类特征的位置为 1,在其他位置为 0.以下是该系列中只有两个分类特征的示例:

as you can see during the mapping, for each categorical feature a vector is created. The elements of the vectors are one at the location of the categorical feature and zero every where else. Here is an example when there are only two categorical features in the series:

S = pd.Series( {'A': ['a', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)

结果:

A    [a, a, c]
dtype: object

   a  c
0  1  0
1  1  0
2  0  1

编辑以回答新问题

让我们从这个问题开始:为什么我们要执行单热编码?如果您将 ['a','b','c'] 之类的分类数据编码为整数 [1,2,3](例如使用 LableEncoder),那么除了对分类数据进行编码之外,您还可以给它们一些权重作为1 <2<3. 这种编码方式适用于一些机器学习技术,如 RandomForest.但是许多机器学习技术会假设在这种情况下 'a' <'b' <'c' 如果你分别用 1、2、3 编码它们.为避免此问题,您可以为数据中的每个唯一分类变量创建一列.换句话说,您为每个分类变量创建了一个新特征(此处一列用于 'a',一列用于 'b',一列用于 'c').如果变量在该索引中,则这些新列中的值设置为 1,其他位置的值设置为 0.

Lets start with this question: Why do we perform a one hot encoding? IF you encode a categorical data like ['a','b','c'] to integers [1,2,3] (e.g. with LableEncoder), in addition to encoding your categorical data, you would give them some weights as 1 < 2 < 3. This way of encoding is fine for some machine learning techniques like RandomForest. But many machine learning techniques would assume that in this case 'a' < 'b' < 'c' if you encoded them with 1, 2, 3 respectively. In order to avoid this issue, you can create a column for each unique categorical variable in your data. In other words, you create a new feature for each categorical variables (here one column for 'a' one for 'b' and one for 'c'). The values in these new columns are set to one if the variable was in that index and zero in other places.

对于您示例中的数组,一个热编码器将是:

For the array in your example, the one hot encoder would be:

features ->  A   B   C   D 

          [[ 1.  0.  0.  0.]
           [ 0.  1.  0.  0.]
           [ 0.  0.  1.  0.]
           [ 0.  0.  0.  1.]]

您有 4 个分类变量A"、B"、C"、D".因此,OneHotEncoder 会将您的 (4,) 数组填充到 (4,4),以便为每个分类变量(这将是您的新功能)提供一个向量(或列).由于A"是数组的 0 元素,因此第一列的索引 0 设置为 1,其余设置为 0.同样,第二个向量(列)属于特征B",因为B"是在数组的索引 1 中,B"向量的索引 1 设置为 1,其余设置为零.这同样适用于其余功能.

You have 4 categorical variables "A", "B", "C", "D". Therefore, OneHotEncoder would populate your (4,) array to (4,4) to have one vector (or column) for each categorical variable (which will be your new features). Since "A" the 0 element of your array, the index 0 of your first column is set to 1 and the rest are set to 0. Similarly, the second vector (column) belongs to feature "B" and since "B" was in the index 1 of your array, the index 1 of the "B" vector is set to 1 and the rest are set to zero. The same applies for the rest of features.

让我改变你的数组.也许它可以帮助您更好地了解标签编码器的工作原理:

Let me change your array. Maybe it can help you to better understand how label encoder works:

S = np.array(['D', 'B','C','A'])
S = le.fit_transform(S)
enc = OneHotEncoder()
encModel = enc.fit_transform(S.reshape(-1,1)).toarray()
print(encModel)

现在结果如下.这里的第一列是A",因为它是数组的最后一个元素(索引 = 3),所以第一列的最后一个元素将为 1.

now the result is the following. Here the first column is 'A' and since it was last element of your array (index = 3), the last element of first column would be 1.

features ->  A   B   C   D
          [[ 0.  0.  0.  1.]
           [ 0.  1.  0.  0.]
           [ 0.  0.  1.  0.]
           [ 1.  0.  0.  0.]]

关于你的 Pandas 数据帧,dataFeat,即使在关于 LableEncoder 如何工作的第一步中,你也是错误的.当您应用 LableEncoder 时,它适合当时的每一列并对其进行编码;然后,它转到下一列并重新拟合该列.这是你应该得到的:

Regarding your pandas dataframe, dataFeat, you are wrong even in the first step about how LableEncoder works. When you apply LableEncoder it fits to each column at the time and encode it; then, it goes to the next column and make a new fit to that column. Here is what you should get:

from sklearn.preprocessing import LabelEncoder
df =  pd.DataFrame({'Feat1': ['A','B','D','C'],'Feat2':['B','B','D','C'],'Feat3':['A','C','A','A'],
                    'Feat4':['A','C','A','A'],'Feat5':['A','C','B','A']})
print('my data frame:')
print(df)

le = LabelEncoder()
intIndexed = df.apply(le.fit_transform)
print('Encoded data frame')
print(intIndexed)

结果:

my data frame:
  Feat1 Feat2 Feat3 Feat4 Feat5
0     A     B     A     A     A
1     B     B     C     C     C
2     D     D     A     A     B
3     C     C     A     A     A

Encoded data frame
   Feat1  Feat2  Feat3  Feat4  Feat5
0      0      0      0      0      0
1      1      0      1      1      2
2      3      2      0      0      1
3      2      1      0      0      0

请注意,在第一列 Feat1 中,'A' 被编码为 0,但在第二列 Feat2 中,'B' 元素为 0.这是因为 LableEncoder 适合每一列并单独转换.请注意,在 ('B', 'C', 'D') 之间的第二列中,变量 'B' 在字母顺序上更胜一筹.

Note that in the first column Feat1 'A' is encoded to 0 but in second column Feat2 the 'B' element is 0. This happens since LableEncoder fits to each column and transform it separately. Note that in your second column among ('B', 'C', 'D') the variable 'B' is alphabetically superior.

最后,这里是您使用 sklearn 寻找的内容:

And finally, here is what you are looking for with sklearn:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
label_encoder = LabelEncoder()
data_lable_encoded = df.apply(label_encoder.fit_transform).as_matrix()
data_feature_onehot = encoder.fit_transform(data_lable_encoded).toarray()
print(data_feature_onehot)

这给了你:

[[ 1.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  1.  0.  0.]]

如果你使用 pandas,你可以比较结果,希望能给你一个更好的直觉:

if you use pandas, you can compare the results and hopefully gives you a better intuition:

encoded = pd.get_dummies(df)
print(encoded)

结果:

     Feat1_A  Feat1_B  Feat1_C  Feat1_D  Feat2_B  Feat2_C  Feat2_D  Feat3_A  \
0        1        0        0        0        1        0        0        1   
1        0        1        0        0        1        0        0        0   
2        0        0        0        1        0        0        1        1   
3        0        0        1        0        0        1        0        1   

     Feat3_C  Feat4_A  Feat4_C  Feat5_A  Feat5_B  Feat5_C  
0        0        1        0        1        0        0  
1        1        0        1        0        0        1  
2        0        1        0        0        1        0  
3        0        1        0        1        0        0  

完全一样!

这篇关于OneHotEncoding 映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆