将2d矩阵转换为3d一个热矩阵numpy [英] Convert a 2d matrix to a 3d one hot matrix numpy

查看:75
本文介绍了将2d矩阵转换为3d一个热矩阵numpy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有np矩阵,我想将其热编码为第三维以将其转换为3d数组.有没有一种方法可以不循环遍历每一行 例如

I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg

a=[[1,3],
   [2,4]]

应制成

b=[[1,0,0,0], [0,0,1,0],
   [0,1,0,0], [0,0,0,1]]

推荐答案

方法1

这是一种厚脸皮的单线,滥用了 比较-

(np.arange(a.max()) == a[...,None]-1).astype(int)

样品运行-

In [120]: a
Out[120]: 
array([[1, 7, 5, 3],
       [2, 4, 1, 4]])

In [121]: (np.arange(a.max()) == a[...,None]-1).astype(int)
Out[121]: 
array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]]])

对于0-based索引,应为-

In [122]: (np.arange(a.max()+1) == a[...,None]).astype(int)
Out[122]: 
array([[[0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0]],

       [[0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0]]])

如果一键式编码要覆盖从最小值到最大值的范围内的值,则将其偏移最小值,然后将其馈送到建议的0-based索引方法中.这也将适用于本文后面稍后讨论的其余方法.

If the one-hot enconding is to cover for the range of values ranging from the minimum to the maximum values, then offset by the minimum value and then feed it to the proposed method for 0-based indexing. This would be applicable for rest of the approaches discussed later on in this post as well.

以下是在同一样本上运行的示例-

Here's a sample run on the same -

In [223]: a
Out[223]: 
array([[ 6, 12, 10,  8],
       [ 7,  9,  6,  9]])

In [224]: a_off = a - a.min() # feed a_off to proposed approaches

In [225]: (np.arange(a_off.max()+1) == a_off[...,None]).astype(int)
Out[225]: 
array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]]])

如果您可以使用布尔数组,对于1's使用True且对于0's使用False,则可以跳过.astype(int)转换.

If you are okay with a boolean array with True for 1's and False for 0's, you can skip the .astype(int) conversion.

我们还可以使用

We can also initialize a zeros arrays and index into the output with advanced-indexing. Thus, for 0-based indexing, we would have -

def onehot_initialization(a):
    ncols = a.max()+1
    out = np.zeros(a.shape + (ncols,), dtype=int)
    out[all_idx(a, axis=2)] = 1
    return out

Helper func-

Helper func -

# https://stackoverflow.com/a/46103129/ @Divakar
def all_idx(idx, axis):
    grid = np.ogrid[tuple(map(slice, idx.shape))]
    grid.insert(axis, idx)
    return tuple(grid)

当处理更大范围的值时,这应该表现得更好.

This should be especially more performant when dealing with larger range of values.

对于1-based索引,只需输入a-1作为输入.

For 1-based indexing, simply feed in a-1 as the input.

现在,如果您正在寻找稀疏数组作为输出和AFAIK,因为scipy的内置稀疏矩阵仅支持2D格式,则可以得到稀疏输出,该输出是前面显示的输出的重塑版本,前两个轴合并第三轴保持不变. 0-based索引的实现看起来像这样-

Now, if you are looking for sparse array as output and AFAIK since scipy's inbuilt sparse matrices support only 2D formats, you can get a sparse output that is a reshaped version of the output shown earlier with the first two axes merging and the third axis being kept intact. The implementation for 0-based indexing would look something like this -

from scipy.sparse import coo_matrix
def onehot_sparse(a):
    N = a.size
    L = a.max()+1
    data = np.ones(N,dtype=int)
    return coo_matrix((data,(np.arange(N),a.ravel())), shape=(N,L))

同样,对于1-based索引,只需输入a-1作为输入.

Again, for 1-based indexing, simply feed in a-1 as the input.

样品运行-

In [157]: a
Out[157]: 
array([[1, 7, 5, 3],
       [2, 4, 1, 4]])

In [158]: onehot_sparse(a).toarray()
Out[158]: 
array([[0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0]])

In [159]: onehot_sparse(a-1).toarray()
Out[159]: 
array([[1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0]])

如果您可以使用稀疏输出,那么这将比前两种方法好得多.

This would be much better than previous two approaches if you are okay with having sparse output.

基于0的索引的运行时比较

案例1:

In [160]: a = np.random.randint(0,100,(100,100))

In [161]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
1000 loops, best of 3: 1.51 ms per loop

In [162]: %timeit onehot_initialization(a)
1000 loops, best of 3: 478 µs per loop

In [163]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.5 µs per loop

In [164]: %timeit onehot_sparse(a).toarray()
1000 loops, best of 3: 530 µs per loop

案例2:

In [166]: a = np.random.randint(0,500,(100,100))

In [167]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
100 loops, best of 3: 8.51 ms per loop

In [168]: %timeit onehot_initialization(a)
100 loops, best of 3: 2.52 ms per loop

In [169]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.1 µs per loop

In [170]: %timeit onehot_sparse(a).toarray()
100 loops, best of 3: 2.67 ms per loop

挤出最佳性能

为了获得最佳性能,我们可以修改方法2,以在2D形状的输出数组上使用索引,还可以使用uint8 dtype来提高内存效率,并加快分配速度,就像这样-

Squeezing out best performance

To squeeze out the best performance, we could modify approach #2 to use indexing on a 2D shaped output array and also use uint8 dtype for memory efficiency and that leading to much faster assignments, like so -

def onehot_initialization_v2(a):
    ncols = a.max()+1
    out = np.zeros( (a.size,ncols), dtype=np.uint8)
    out[np.arange(a.size),a.ravel()] = 1
    out.shape = a.shape + (ncols,)
    return out

时间-

In [178]: a = np.random.randint(0,100,(100,100))

In [179]: %timeit onehot_initialization(a)
     ...: %timeit onehot_initialization_v2(a)
     ...: 
1000 loops, best of 3: 474 µs per loop
10000 loops, best of 3: 128 µs per loop

In [180]: a = np.random.randint(0,500,(100,100))

In [181]: %timeit onehot_initialization(a)
     ...: %timeit onehot_initialization_v2(a)
     ...: 
100 loops, best of 3: 2.38 ms per loop
1000 loops, best of 3: 213 µs per loop

这篇关于将2d矩阵转换为3d一个热矩阵numpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆