如何反向sklearn.OneHotEncoder转换以恢复原始数据? [英] How to reverse sklearn.OneHotEncoder transform to recover original data?
问题描述
我使用sklearn.OneHotEncoder
对分类数据进行了编码,并将其输入到随机森林分类器中.一切似乎正常,我得到了预期的输出.
I encoded my categorical data using sklearn.OneHotEncoder
and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.
有没有办法反转编码并将我的输出转换回原始状态?
Is there a way to reverse the encoding and convert my output back to its original state?
推荐答案
A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder
source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.
X = np.array([
[3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
[5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T
n_values _
第1763-1786行确定n_values_
参数.如果设置n_values='auto'
(默认值),则将自动确定.或者,您可以指定所有功能的最大值(int)或每个功能的最大值(数组).假设我们使用默认值.因此,执行以下几行:
n_values_
Lines 1763-1786 determine the n_values_
parameter. This will be determined automatically if you set n_values='auto'
(the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:
n_samples, n_features = X.shape # 10, 2
n_values = np.max(X, axis=0) + 1 # [100, 21]
self.n_values_ = n_values
feature_indices _
接下来,将计算feature_indices_
参数.
n_values = np.hstack([[0], n_values]) # [0, 100, 21]
indices = np.cumsum(n_values) # [0, 100, 121]
self.feature_indices_ = indices
所以feature_indices_
只是n_values_
的累积和,且前加0.
So feature_indices_
is merely the cumulative sum of n_values_
with a 0 prepended.
接下来,一个 scipy.sparse.coo_matrix
是根据数据构建的.它从三个数组初始化:稀疏数据(全为稀疏),行索引和列索引.
Next, a scipy.sparse.coo_matrix
is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.
column_indices = (X + indices[:-1]).ravel()
# array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108])
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)
data = np.ones(n_samples * n_features)
# array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
out = sparse.coo_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
请注意,coo_matrix
会立即转换为 scipy.sparse.csr_matrix
一个>. coo_matrix
用作中间格式,因为它有助于稀疏格式之间的快速转换."
Note that the coo_matrix
is immediately converted to a scipy.sparse.csr_matrix
. The coo_matrix
is used as an intermediate format because it "facilitates fast conversion among sparse formats."
现在,如果为n_values='auto'
,则将稀疏csr矩阵压缩为仅具有活动特征的列.如果sparse=True
,则返回稀疏的csr_matrix
,否则将在返回之前将其压缩.
Now, if n_values='auto'
, the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix
is returned if sparse=True
, otherwise it is densified before returning.
if self.n_values == 'auto':
mask = np.array(out.sum(axis=0)).ravel() != 0
active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
self.active_features_ = active_features
return out if self.sparse else out.toarray()
解码
现在让我们反向进行.考虑到返回的稀疏矩阵以及上面详述的OneHotEncoder
功能,我们想知道如何恢复X
.假设我们实际上是通过实例化一个新的OneHotEncoder
并在数据X
上运行fit_transform
来运行上述代码的.
Decoding
Now let's work in reverse. We'd like to know how to recover X
given the sparse matrix that is returned along with the OneHotEncoder
features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder
and running fit_transform
on our data X
.
from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder() # all default params
out = ohc.fit_transform(X)
解决此问题的关键见解是了解active_features_
和out.indices
之间的关系.对于csr_matrix
,索引数组包含每个数据点的列号.但是,不能保证对这些列号进行排序.要对其进行排序,我们可以使用sorted_indices
方法.
The key insight to solving this problem is understanding the relationship between active_features_
and out.indices
. For a csr_matrix
, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices
method.
out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32)
out = out.sorted_indices()
out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)
我们可以看到在排序之前,索引实际上是沿着行反转的.换句话说,它们的顺序是最后一列在前,第一列在后.从前两个元素可以明显看出这一点:[12,0]. 0对应X
的第一列中的3,因为3是分配给第一活动列的最小元素. 12对应X
第二列中的5.由于第一行占用10个不同的列,因此第二列的最小元素(1)的索引为10.第二最小的元素(3)的索引为11,第三最小的元素(5)的索引为12.按照我们的预期订购.
We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X
, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X
. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.
接下来我们看一下active_features_
:
ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
请注意,有19个元素,对应于我们数据中不同元素的数量(一个元素8,重复一次).还要注意,这些是按顺序排列的. X
的第一列中的功能相同,而第二列中的特征仅加了100,与ohc.feature_indices_[1]
相对应.
Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X
are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1]
.
回头看out.indices
,我们可以看到最大列数为18,这是我们的编码中减去19个活动特征的数量.对此关系稍加思考,就会发现ohc.active_features_
的索引与ohc.indices
中的列号相对应.这样,我们可以解码:
Looking back at out.indices
, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_
correspond to the column numbers in ohc.indices
. With this, we can decode:
import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)
这给了我们
array([[ 3, 105],
[ 10, 101],
[ 15, 103],
[ 33, 107],
[ 54, 108],
[ 55, 112],
[ 78, 115],
[ 79, 119],
[ 80, 120],
[ 99, 108]])
我们可以通过减去ohc.feature_indices_
的偏移量来返回原始特征值:
And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_
:
recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3, 5],
[10, 1],
[15, 3],
[33, 7],
[54, 8],
[55, 12],
[78, 15],
[79, 19],
[80, 20],
[99, 8]])
请注意,您将需要具有X
的原始形状,即(n_samples, n_features)
.
Note that you will need to have the original shape of X
, which is simply (n_samples, n_features)
.
给出一个名为ohc
的sklearn.OneHotEncoder
实例,从一个名为out
的ohc.fit_transform
或ohc.transform
输出的编码数据(scipy.sparse.csr_matrix
),以及原始数据(n_samples, n_feature)
的形状,恢复原始数据数据X
具有:
Given the sklearn.OneHotEncoder
instance called ohc
, the encoded data (scipy.sparse.csr_matrix
) output from ohc.fit_transform
or ohc.transform
called out
, and the shape of the original data (n_samples, n_feature)
, recover the original data X
with:
recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
.reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
这篇关于如何反向sklearn.OneHotEncoder转换以恢复原始数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!