使用python解释onehotencoder [英] Explain onehotencoder using python

查看:545
本文介绍了使用python解释onehotencoder的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scikit-learn库的新手,并一直在尝试使用它来预测股票价格.我正在浏览其文档,并停留在他们解释OneHotEncoder()的部分.这是他们使用的代码:

I am new to scikit-learn library and have been trying to play with it for prediction of stock prices. I was going through its documentation and got stuck at the part where they explain OneHotEncoder(). Here is the code that they have used :

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

有人可以逐步向我解释一下这里发生了什么吗?我有一个明确的想法,一个热编码器如何工作,但我无法弄清楚这段代码是如何工作的.任何帮助表示赞赏.谢谢!

Can someone please explain it to me step by step what is happening here? I have a clear idea how One hot encoder works but I'm not able to figure out how this code works. Any help is appreciated. Thanks!

推荐答案

首先让我们写下您的期望(假设您知道One Hot Encoding的含义)

Lets start off first by writing down what you would expect (assuming you know what One Hot Encoding means)

未编码

f0 f1 f2
0, 0, 3
1, 1, 0
0, 2, 1
1, 0, 2

已编码

|f0|  |  f1 |  |   f2   |

1, 0, 1, 0, 0, 0, 0, 0, 1 
0, 1, 0, 1, 0, 1, 0, 0, 0
1, 0, 0, 0, 1, 0, 1, 0, 0
0, 1, 1, 0, 0, 0, 0, 1, 0

编码:

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]),

(如果使用默认的n_values='auto').在使用default ='auto'时,您要指定可以从传递给fit的数据列中的值推断出您的要素(未编码的列)可能采用的值.

if you use the default n_values='auto'. In using default='auto' you're specifying that the values your features (columns of unencoded) could possibly take on can be inferred from the values in the columns of the data handed to fit.

这使我们进入enc.n_values_

来自文档:

每个功能的值数.

Number of values per feature.

enc.n_values_
array([2, 3, 4])

上面的意思是f0(第1列)可以取2个值(0,1),f1可以取3个值,(0,1,2)和f2可以取4个值(0,1,2 ,3).

The above means that f0 (column 1) can take on 2 values (0, 1), f1 can take on 3 values, (0, 1, 2) and f2 can take on 4 values (0, 1, 2, 3).

实际上,这些是未编码特征矩阵中特征f1,f2和f3的值.

Indeed these are the values from the features f1, f2 ,f3 in the unencoded feature matrix.

然后

enc.feature_indices_
array([0, 2, 5, 9])

来自文档:

特征范围的指示.原始数据中的特征i映射到 从feature_indices_ [i]到feature_indices_ [i + 1]的要素(然后 可能随后被active_features_屏蔽)

Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

给定是特征f1,f2,f3可以占据的位置范围(在编码空间中的 ).

Given is the range of positions (in the encoded space) that features f1, f2, f3 can take on.

f1: [0, 1], f2: [2, 3, 4], f3: [5, 6, 7, 8]

将向量[0,1,1]映射到一个热编码空间中(在我们从enc.fit获得的映射下):

Mapping the vector [0, 1, 1] into one hot encoded space (under the mapping by we got from enc.fit):

1, 0, 0, 1, 0, 0, 1, 0, 0

如何?

f0中的第一个特征,以便映射到位置0(如果元素是1而不是0,我们将其映射到位置1).

The first feature in the f0 so that maps to position 0 (if the element was 1 instead of 0 we would map it into position 1).

下一个元素1映射到位置3,因为f1从位置2开始并且元素1是f1可以采用的第二个可能值.

The next element 1 maps into position 3 because f1 starts at position 2 and the element 1 is the second possible value f1 can take on.

最后,第三个元素1处于位置6,因为第二个可能值f2出现并且f2从位置5开始映射.

Finally the third element 1 takes on position 6 since it the second possible value f2 takes on and f2 starts getting mapped from position 5.

希望能清除一些东西.

这篇关于使用python解释onehotencoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆