OneHotEncoding 映射 [英] OneHotEncoding Mapping
问题描述
为了离散化分类特征,我使用了 LabelEncoder 和 OneHotEncoder.我知道 LabelEncoder 按字母顺序映射数据,但 OneHotEncoder 如何映射数据?
To discretize categorical features I'm using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data?
我有一个 Pandas 数据框,dataFeat
有 5 个不同的列和 4 个可能的标签,如上所示.dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']]
I have a pandas dataframe, dataFeat
with 5 different columns, and 4 possible labels, like above.
dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']]
Feat1 Feat2 Feat3 Feat4 Feat5
A B A A A
B B C C C
D D A A B
C C A A A
我像这样应用labelencoder
,
le = preprocessing.LabelEncoder()
intIndexed = dataFeat.apply(le.fit_transform)
这是标签编码器的编码方式
This is how the labels are encoded by the LabelEncoder
Label LabelEncoded
A 0
B 1
C 2
D 3
然后我像这样应用 OneHotEncoder
I then apply a OneHotEncoder like this
enc = OneHotEncoder(sparse = False)
encModel = enc.fit(intIndexed)
dataFeatY = encModel.transform(intIndexed)
intIndexed.shape = 94,5
和 dataFeatY.shape=94,20
.
我对 dataFeatY
的形状有点困惑——它不应该也是 95,5 吗?
I am a bit confused with the shape of dataFeatY
- shouldn't it also be 95,5?
按照下面的 MhFarahani 回答,我这样做是为了查看标签是如何映射的
Following MhFarahani answer below, I have done this to see how labels are mapped
import numpy as np
S = np.array(['A', 'B','C','D'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
[0 1 2 3]
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot.T)
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
这是否意味着标签是这样映射的,还是每列都不同?(这可以解释形状为 94,20)
Does this mean that labels are mapped like this, or is it different for each column ? (which would explain the shape being 94,20)
Label LabelEncoded OneHotEncoded
A 0 1. 0. 0. 0
B 1 0. 1. 0. 0.
C 2 0. 0. 1. 0.
D 3 0. 0. 0. 1.
推荐答案
一种热编码意味着您可以创建 1 和 0 的向量.所以顺序无关紧要.在sklearn
中,首先需要将分类数据编码为数值数据,然后将它们提供给OneHotEncoder
,例如:
One hot encoding means that you create vectors of one and zero. So the order does not matter.
In sklearn
, first you need to encode the categorical data to numerical data and then feed them to the OneHotEncoder
, for example:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
S = np.array(['b','a','c'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot)
导致:
[1 0 2]
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]
但是pandas
直接转换分类数据:
But pandas
directly convert the categorical data:
import pandas as pd
S = pd.Series( {'A': ['b', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)
输出:
A [b, a, c]
dtype: object
a b c
0 0 1 0
1 1 0 0
2 0 0 1
正如您在映射过程中所看到的,为每个分类特征创建了一个向量.向量的元素在分类特征的位置为 1,在其他位置为 0.以下是该系列中只有两个分类特征的示例:
as you can see during the mapping, for each categorical feature a vector is created. The elements of the vectors are one at the location of the categorical feature and zero every where else. Here is an example when there are only two categorical features in the series:
S = pd.Series( {'A': ['a', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)
结果:
A [a, a, c]
dtype: object
a c
0 1 0
1 1 0
2 0 1
编辑以回答新问题
让我们从这个问题开始:为什么我们要执行单热编码?如果您将 ['a','b','c'] 之类的分类数据编码为整数 [1,2,3](例如使用 LableEncoder),那么除了对分类数据进行编码之外,您还可以给它们一些权重作为1 <2<3. 这种编码方式适用于一些机器学习技术,如 RandomForest.但是许多机器学习技术会假设在这种情况下 'a' <'b' <'c' 如果你分别用 1、2、3 编码它们.为避免此问题,您可以为数据中的每个唯一分类变量创建一列.换句话说,您为每个分类变量创建了一个新特征(此处一列用于 'a',一列用于 'b',一列用于 'c').如果变量在该索引中,则这些新列中的值设置为 1,其他位置的值设置为 0.
Lets start with this question: Why do we perform a one hot encoding? IF you encode a categorical data like ['a','b','c'] to integers [1,2,3] (e.g. with LableEncoder), in addition to encoding your categorical data, you would give them some weights as 1 < 2 < 3. This way of encoding is fine for some machine learning techniques like RandomForest. But many machine learning techniques would assume that in this case 'a' < 'b' < 'c' if you encoded them with 1, 2, 3 respectively. In order to avoid this issue, you can create a column for each unique categorical variable in your data. In other words, you create a new feature for each categorical variables (here one column for 'a' one for 'b' and one for 'c'). The values in these new columns are set to one if the variable was in that index and zero in other places.
对于您示例中的数组,一个热编码器将是:
For the array in your example, the one hot encoder would be:
features -> A B C D
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
您有 4 个分类变量A"、B"、C"、D".因此,OneHotEncoder 会将您的 (4,) 数组填充到 (4,4),以便为每个分类变量(这将是您的新功能)提供一个向量(或列).由于A"是数组的 0 元素,因此第一列的索引 0 设置为 1,其余设置为 0.同样,第二个向量(列)属于特征B",因为B"是在数组的索引 1 中,B"向量的索引 1 设置为 1,其余设置为零.这同样适用于其余功能.
You have 4 categorical variables "A", "B", "C", "D". Therefore, OneHotEncoder would populate your (4,) array to (4,4) to have one vector (or column) for each categorical variable (which will be your new features). Since "A" the 0 element of your array, the index 0 of your first column is set to 1 and the rest are set to 0. Similarly, the second vector (column) belongs to feature "B" and since "B" was in the index 1 of your array, the index 1 of the "B" vector is set to 1 and the rest are set to zero. The same applies for the rest of features.
让我改变你的数组.也许它可以帮助您更好地了解标签编码器的工作原理:
Let me change your array. Maybe it can help you to better understand how label encoder works:
S = np.array(['D', 'B','C','A'])
S = le.fit_transform(S)
enc = OneHotEncoder()
encModel = enc.fit_transform(S.reshape(-1,1)).toarray()
print(encModel)
现在结果如下.这里的第一列是A",因为它是数组的最后一个元素(索引 = 3),所以第一列的最后一个元素将为 1.
now the result is the following. Here the first column is 'A' and since it was last element of your array (index = 3), the last element of first column would be 1.
features -> A B C D
[[ 0. 0. 0. 1.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]]
关于你的 Pandas 数据帧,dataFeat
,即使在关于 LableEncoder
如何工作的第一步中,你也是错误的.当您应用 LableEncoder
时,它适合当时的每一列并对其进行编码;然后,它转到下一列并重新拟合该列.这是你应该得到的:
Regarding your pandas dataframe, dataFeat
, you are wrong even in the first step about how LableEncoder
works. When you apply LableEncoder
it fits to each column at the time and encode it; then, it goes to the next column and make a new fit to that column. Here is what you should get:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'Feat1': ['A','B','D','C'],'Feat2':['B','B','D','C'],'Feat3':['A','C','A','A'],
'Feat4':['A','C','A','A'],'Feat5':['A','C','B','A']})
print('my data frame:')
print(df)
le = LabelEncoder()
intIndexed = df.apply(le.fit_transform)
print('Encoded data frame')
print(intIndexed)
结果:
my data frame:
Feat1 Feat2 Feat3 Feat4 Feat5
0 A B A A A
1 B B C C C
2 D D A A B
3 C C A A A
Encoded data frame
Feat1 Feat2 Feat3 Feat4 Feat5
0 0 0 0 0 0
1 1 0 1 1 2
2 3 2 0 0 1
3 2 1 0 0 0
请注意,在第一列 Feat1
中,'A' 被编码为 0,但在第二列 Feat2
中,'B' 元素为 0.这是因为 LableEncoder
适合每一列并单独转换.请注意,在 ('B', 'C', 'D') 之间的第二列中,变量 'B' 在字母顺序上更胜一筹.
Note that in the first column Feat1
'A' is encoded to 0 but in second column Feat2
the 'B' element is 0. This happens since LableEncoder
fits to each column and transform it separately. Note that in your second column among ('B', 'C', 'D') the variable 'B' is alphabetically superior.
最后,这里是您使用 sklearn
寻找的内容:
And finally, here is what you are looking for with sklearn
:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
label_encoder = LabelEncoder()
data_lable_encoded = df.apply(label_encoder.fit_transform).as_matrix()
data_feature_onehot = encoder.fit_transform(data_lable_encoded).toarray()
print(data_feature_onehot)
这给了你:
[[ 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0.]]
如果你使用 pandas
,你可以比较结果,希望能给你一个更好的直觉:
if you use pandas
, you can compare the results and hopefully gives you a better intuition:
encoded = pd.get_dummies(df)
print(encoded)
结果:
Feat1_A Feat1_B Feat1_C Feat1_D Feat2_B Feat2_C Feat2_D Feat3_A \
0 1 0 0 0 1 0 0 1
1 0 1 0 0 1 0 0 0
2 0 0 0 1 0 0 1 1
3 0 0 1 0 0 1 0 1
Feat3_C Feat4_A Feat4_C Feat5_A Feat5_B Feat5_C
0 0 1 0 1 0 0
1 1 0 1 0 0 1
2 0 1 0 0 1 0
3 0 1 0 1 0 0
完全一样!
这篇关于OneHotEncoding 映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!