将2d矩阵转换为3d一个热矩阵numpy [英] Convert a 2d matrix to a 3d one hot matrix numpy
问题描述
我有np矩阵,我想将其热编码为第三维以将其转换为3d数组.有没有一种方法可以不循环遍历每一行 例如
I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg
a=[[1,3],
[2,4]]
应制成
b=[[1,0,0,0], [0,0,1,0],
[0,1,0,0], [0,0,0,1]]
推荐答案
方法1
(np.arange(a.max()) == a[...,None]-1).astype(int)
样品运行-
In [120]: a
Out[120]:
array([[1, 7, 5, 3],
[2, 4, 1, 4]])
In [121]: (np.arange(a.max()) == a[...,None]-1).astype(int)
Out[121]:
array([[[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0]],
[[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0]]])
对于0-based
索引,应为-
In [122]: (np.arange(a.max()+1) == a[...,None]).astype(int)
Out[122]:
array([[[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0]],
[[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0]]])
如果一键式编码要覆盖从最小值到最大值的范围内的值,则将其偏移最小值,然后将其馈送到建议的0-based
索引方法中.这也将适用于本文后面稍后讨论的其余方法.
If the one-hot enconding is to cover for the range of values ranging from the minimum to the maximum values, then offset by the minimum value and then feed it to the proposed method for 0-based
indexing. This would be applicable for rest of the approaches discussed later on in this post as well.
以下是在同一样本上运行的示例-
Here's a sample run on the same -
In [223]: a
Out[223]:
array([[ 6, 12, 10, 8],
[ 7, 9, 6, 9]])
In [224]: a_off = a - a.min() # feed a_off to proposed approaches
In [225]: (np.arange(a_off.max()+1) == a_off[...,None]).astype(int)
Out[225]:
array([[[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0]],
[[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0]]])
如果您可以使用布尔数组,对于1's
使用True
且对于0's
使用False,则可以跳过.astype(int)
转换.
If you are okay with a boolean array with True
for 1's
and False for 0's
, you can skip the .astype(int)
conversion.
We can also initialize a zeros arrays and index into the output with advanced-indexing
. Thus, for 0-based
indexing, we would have -
def onehot_initialization(a):
ncols = a.max()+1
out = np.zeros(a.shape + (ncols,), dtype=int)
out[all_idx(a, axis=2)] = 1
return out
Helper func-
Helper func -
# https://stackoverflow.com/a/46103129/ @Divakar
def all_idx(idx, axis):
grid = np.ogrid[tuple(map(slice, idx.shape))]
grid.insert(axis, idx)
return tuple(grid)
当处理更大范围的值时,这应该表现得更好.
This should be especially more performant when dealing with larger range of values.
对于1-based
索引,只需输入a-1
作为输入.
For 1-based
indexing, simply feed in a-1
as the input.
现在,如果您正在寻找稀疏数组作为输出和AFAIK,因为scipy的内置稀疏矩阵仅支持2D
格式,则可以得到稀疏输出,该输出是前面显示的输出的重塑版本,前两个轴合并第三轴保持不变. 0-based
索引的实现看起来像这样-
Now, if you are looking for sparse array as output and AFAIK since scipy's inbuilt sparse matrices support only 2D
formats, you can get a sparse output that is a reshaped version of the output shown earlier with the first two axes merging and the third axis being kept intact. The implementation for 0-based
indexing would look something like this -
from scipy.sparse import coo_matrix
def onehot_sparse(a):
N = a.size
L = a.max()+1
data = np.ones(N,dtype=int)
return coo_matrix((data,(np.arange(N),a.ravel())), shape=(N,L))
同样,对于1-based
索引,只需输入a-1
作为输入.
Again, for 1-based
indexing, simply feed in a-1
as the input.
样品运行-
In [157]: a
Out[157]:
array([[1, 7, 5, 3],
[2, 4, 1, 4]])
In [158]: onehot_sparse(a).toarray()
Out[158]:
array([[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0]])
In [159]: onehot_sparse(a-1).toarray()
Out[159]:
array([[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0]])
如果您可以使用稀疏输出,那么这将比前两种方法好得多.
This would be much better than previous two approaches if you are okay with having sparse output.
基于0的索引的运行时比较
案例1:
In [160]: a = np.random.randint(0,100,(100,100))
In [161]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
1000 loops, best of 3: 1.51 ms per loop
In [162]: %timeit onehot_initialization(a)
1000 loops, best of 3: 478 µs per loop
In [163]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.5 µs per loop
In [164]: %timeit onehot_sparse(a).toarray()
1000 loops, best of 3: 530 µs per loop
案例2:
In [166]: a = np.random.randint(0,500,(100,100))
In [167]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
100 loops, best of 3: 8.51 ms per loop
In [168]: %timeit onehot_initialization(a)
100 loops, best of 3: 2.52 ms per loop
In [169]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.1 µs per loop
In [170]: %timeit onehot_sparse(a).toarray()
100 loops, best of 3: 2.67 ms per loop
挤出最佳性能
为了获得最佳性能,我们可以修改方法2,以在2D
形状的输出数组上使用索引,还可以使用uint8
dtype来提高内存效率,并加快分配速度,就像这样-
Squeezing out best performance
To squeeze out the best performance, we could modify approach #2 to use indexing on a 2D
shaped output array and also use uint8
dtype for memory efficiency and that leading to much faster assignments, like so -
def onehot_initialization_v2(a):
ncols = a.max()+1
out = np.zeros( (a.size,ncols), dtype=np.uint8)
out[np.arange(a.size),a.ravel()] = 1
out.shape = a.shape + (ncols,)
return out
时间-
In [178]: a = np.random.randint(0,100,(100,100))
In [179]: %timeit onehot_initialization(a)
...: %timeit onehot_initialization_v2(a)
...:
1000 loops, best of 3: 474 µs per loop
10000 loops, best of 3: 128 µs per loop
In [180]: a = np.random.randint(0,500,(100,100))
In [181]: %timeit onehot_initialization(a)
...: %timeit onehot_initialization_v2(a)
...:
100 loops, best of 3: 2.38 ms per loop
1000 loops, best of 3: 213 µs per loop
这篇关于将2d矩阵转换为3d一个热矩阵numpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!