如何反向sklearn.OneHotEncoder转换以恢复原始数据? [英] How to reverse sklearn.OneHotEncoder transform to recover original data?

查看:908
本文介绍了如何反向sklearn.OneHotEncoder转换以恢复原始数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用sklearn.OneHotEncoder对分类数据进行了编码,并将其输入到随机森林分类器中.一切似乎正常,我得到了预期的输出.

I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.

有没有办法反转编码并将我的输出转换回原始状态?

Is there a way to reverse the encoding and convert my output back to its original state?

推荐答案

弄清这一点的一种很好的系统方法是从一些测试数据开始并通过

A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.

X = np.array([
    [3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
    [5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T

n_values _

第1763-1786行确定n_values_参数.如果设置n_values='auto'(默认值),则将自动确定.或者,您可以指定所有功能的最大值(int)或每个功能的最大值(数组).假设我们使用默认值.因此,执行以下几行:

n_values_

Lines 1763-1786 determine the n_values_ parameter. This will be determined automatically if you set n_values='auto' (the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:

n_samples, n_features = X.shape    # 10, 2
n_values = np.max(X, axis=0) + 1   # [100, 21]
self.n_values_ = n_values

feature_indices _

接下来,将计算feature_indices_参数.

n_values = np.hstack([[0], n_values])  # [0, 100, 21]
indices = np.cumsum(n_values)          # [0, 100, 121]
self.feature_indices_ = indices

所以feature_indices_只是n_values_的累积和,且前加0.

So feature_indices_ is merely the cumulative sum of n_values_ with a 0 prepended.

接下来,一个 scipy.sparse.coo_matrix 是根据数据构建的.它从三个数组初始化:稀疏数据(全为稀疏),行索引和列索引.

Next, a scipy.sparse.coo_matrix is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.

column_indices = (X + indices[:-1]).ravel()
# array([  3, 105,  10, 101,  15, 103,  33, 107,  54, 108,  55, 112,  78, 115,  79, 119,  80, 120,  99, 108])

row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)

data = np.ones(n_samples * n_features)
# array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 1.,  1.,  1.,  1.,  1.,  1.,  1.])

out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>

请注意,coo_matrix会立即转换为 scipy.sparse.csr_matrix . coo_matrix用作中间格式,因为它有助于稀疏格式之间的快速转换."

Note that the coo_matrix is immediately converted to a scipy.sparse.csr_matrix. The coo_matrix is used as an intermediate format because it "facilitates fast conversion among sparse formats."

现在,如果为n_values='auto',则将稀疏csr矩阵压缩为仅具有活动特征的列.如果sparse=True,则返回稀疏的csr_matrix,否则将在返回之前将其压缩.

Now, if n_values='auto', the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix is returned if sparse=True, otherwise it is densified before returning.

if self.n_values == 'auto':
    mask = np.array(out.sum(axis=0)).ravel() != 0
    active_features = np.where(mask)[0]  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
    out = out[:, active_features]  # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
    self.active_features_ = active_features

return out if self.sparse else out.toarray()

解码

现在让我们反向进行.考虑到返回的稀疏矩阵以及上面详述的OneHotEncoder功能,我们想知道如何恢复X.假设我们实际上是通过实例化一个新的OneHotEncoder并在数据X上运行fit_transform来运行上述代码的.

Decoding

Now let's work in reverse. We'd like to know how to recover X given the sparse matrix that is returned along with the OneHotEncoder features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder and running fit_transform on our data X.

from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder()  # all default params
out = ohc.fit_transform(X)

解决此问题的关键见解是了解active_features_out.indices之间的关系.对于csr_matrix,索引数组包含每个数据点的列号.但是,不能保证对这些列号进行排序.要对其进行排序,我们可以使用sorted_indices方法.

The key insight to solving this problem is understanding the relationship between active_features_ and out.indices. For a csr_matrix, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices method.

out.indices  # array([12,  0, 10,  1, 11,  2, 13,  3, 14,  4, 15,  5, 16,  6, 17,  7, 18, 8, 14,  9], dtype=int32)
out = out.sorted_indices()
out.indices  # array([ 0, 12,  1, 10,  2, 11,  3, 13,  4, 14,  5, 15,  6, 16,  7, 17,  8, 18,  9, 14], dtype=int32)

我们可以看到在排序之前,索引实际上是沿着行反转的.换句话说,它们的顺序是最后一列在前,第一列在后.从前两个元素可以明显看出这一点:[12,0]. 0对应X的第一列中的3,因为3是分配给第一活动列的最小元素. 12对应X第二列中的5.由于第一行占用10个不同的列,因此第二列的最小元素(1)的索引为10.第二最小的元素(3)的索引为11,第三最小的元素(5)的索引为12.按照我们的预期订购.

We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.

接下来我们看一下active_features_:

ohc.active_features_  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])

请注意,有19个元素,对应于我们数据中不同元素的数量(一个元素8,重复一次).还要注意,这些是按顺序排列的. X的第一列中的功能相同,而第二列中的特征仅加了100,与ohc.feature_indices_[1]相对应.

Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1].

回头看out.indices,我们可以看到最大列数为18,这是我们的编码中减去19个活动特征的数量.对此关系稍加思考,就会发现ohc.active_features_的索引与ohc.indices中的列号相对应.这样,我们可以解码:

Looking back at out.indices, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_ correspond to the column numbers in ohc.indices. With this, we can decode:

import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)

这给了我们

array([[  3, 105],
       [ 10, 101],
       [ 15, 103],
       [ 33, 107],
       [ 54, 108],
       [ 55, 112],
       [ 78, 115],
       [ 79, 119],
       [ 80, 120],
       [ 99, 108]])

我们可以通过减去ohc.feature_indices_的偏移量来返回原始特征值:

And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_:

recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3,  5],
       [10,  1],
       [15,  3],
       [33,  7],
       [54,  8],
       [55, 12],
       [78, 15],
       [79, 19],
       [80, 20],
       [99,  8]])

请注意,您将需要具有X的原始形状,即(n_samples, n_features).

Note that you will need to have the original shape of X, which is simply (n_samples, n_features).

给出一个名为ohcsklearn.OneHotEncoder实例,从一个名为outohc.fit_transformohc.transform输出的编码数据(scipy.sparse.csr_matrix),以及原始数据(n_samples, n_feature)的形状,恢复原始数据数据X具有:

Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
                .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]

这篇关于如何反向sklearn.OneHotEncoder转换以恢复原始数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆