按特定顺序将空列插入scipy稀疏矩阵 [英] Inserting null columns into a scipy sparse matrix in a specific order
问题描述
我有一个包含M行和N列的稀疏矩阵,我想将K个其他的NULL列连接在一起,因此我的对象现在将具有M行和(N + K)列.棘手的部分是,我还有一个长度为N的索引列表,其长度范围从0到N + K,它们指示在新矩阵中每一列应具有的位置.
I have a sparse matrix with M rows and N columns, to which I want to concatenate K additional NULL columns so my objects will have now M rows and (N+K) columns. The tricky part is that I also have a list of indeces of length N, which can range from 0 to N+K, that indicate what is the position that every column should have in the new matrix.
因此,例如,如果N = 2,K = 1且索引列表为[2,0],则意味着我想将MxN矩阵中的最后一列作为第一列,引入空列,然后将我的第一列作为最后一列.
So for example, if N = 2, K = 1 and the list of indices is [2, 0], it means that I want to take the last column from my MxN matrix to be the first one, the introduce a null column and then put my first column as the last one.
我正在尝试使用以下代码-当我已经有x但无法在此处上传它时.
I'm trying to use the following code - when I already have x but I can't upload it here.
import numpy as np
from scipy import sparse
M = 5000
N = 10
pad_factor = 1.2
size = int(pad_factor * N)
x = sparse.random(m = M, n = N, density = 0.1, dtype = 'float64')
indeces = np.random.choice(range(size), size=N, replace=False)
null_mat = sparse.lil_matrix((M, size))
null_mat[:, indeces] = x
问题是,对于N = 1,500,000,P = 5,000和K = 200的代码,该代码将无法缩放,并且会给我带来内存错误.确切的错误是: 返回np.zeros(self.shape,dtype = self.dtype,order = order)MemoryError".
The problem is that for N = 1,500,000, P = 5,000 and K = 200 this code won't scale and it will give me a memory error. The exact error is: "return np.zeros(self.shape, dtype = self.dtype, order=order) MemoryError".
我只想添加一些空列,所以我想我的切片想法效率低下,尤其是当K<< N在我的真实数据中.在某种程度上,我们可以将其视为合并排序问题-我有一个非null和null的数据集,我想按特定顺序将它们连接起来.关于如何使它起作用的任何想法?
I just want to add some null columns so I guess my slicing idea is inefficient, especially as K << N in my real data. In a way we can think about this as a merge sort problem - I have a non-null and a null dataset and I want to concatenate them, but in a specific order. Any ideas on how to make it work?
谢谢!
推荐答案
正如我在注释中推断的那样,内存错误是在
As I deduced in the comments, the memory error was produced in the
null_mat[:, indeces] = x
行是因为lil
__setitem__
方法执行x.toarray()
,也就是说,它首先将x
转换为密集数组.将稀疏矩阵直接映射到索引lil
可能会更节省空间,但是要进行大量代码编写.并且lil
针对迭代分配进行了优化,而不是针对大规模矩阵映射.
line because the lil
__setitem__
method, does a x.toarray()
, that is, it first converts x
to a dense array. Mapping the sparse matrix onto the index lil
directly might be more space efficient, but a lot more work to code. And lil
is optimized for iterative assignment, not this large scale matrix mapping.
sparse.hstack
使用sparse.bmat
联接稀疏矩阵.它将所有输入转换为coo
,然后将其属性组合为一个新集合,并根据这些属性构建新矩阵.
sparse.hstack
uses sparse.bmat
to join sparse matrices. This converts all inputs to coo
, and then combines their attributes into a new set, building the new matrix from those.
经过大量的试验,我发现以下简单的操作有效:
After quite a bit of playing around, I found that the following simple operation works:
In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
In [480]: z1
Out[480]:
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in COOrdinate format>
将此与x
和null_mat
进行比较:
In [481]: x
Out[481]:
<5000x10 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]:
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in LInked List format>
测试稀疏矩阵的相等性可能很棘手.特别是coo
值可以按任何顺序出现,例如x
由sparse.random
产生.
Testing the equality of sparse matrices can be tricky. coo
values in particular can occur in any order, such as in x
which was produced by sparse.random
.
但是csr
格式对行进行排序,因此indptr
属性的此比较是一个很好的相等性测试:
But the csr
format orders the rows, so this comparison of the indptr
attribute is a pretty good equality test:
In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True
时间测试:
In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [478]:
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
矩阵乘法方法
使用列表的 csr
格式索引是通过矩阵乘法完成的.它构造一个extractor
矩阵,并将其应用.矩阵乘法是csr_matrix
的强项.
matrix multiplication approach
csr
format indexing with lists is done with matrix multiplication. It constructs an extractor
matrix, and applies that. Matrix multiplication is a csr_matrix
strong point.
我们可以用相同的方式执行重新排序:
We can perform the reordering in the same way:
In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [496]: w1=x*I
比较这些矩阵的密集等效项:
Comparing the dense equivalents of these matrices:
In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True
In [499]: %%timeit
...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
...: 12))
...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这比lil
索引方法更好,但是仍然比直接coo
矩阵构造慢得多.公平地说,我们应该从coo
样式输入构造一个csr
矩阵.该转换需要一些时间:
That's better than the lil
indexing approach, though still much slower than the direct coo
matrix construction. Though to be fair, we should construct a csr
matrix from the coo
style inputs. That conversion takes some time:
In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
错误回溯
MemoryError回溯应该已经表明该索引分配中发生了错误,并且相关的方法调用是:
error traceback
The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:
Signature: null_mat.__setitem__(index, x)
Source:
def __setitem__(self, index, x):
....
if isspmatrix(x):
x = x.toarray()
...
Signature: x.toarray(order=None, out=None)
Source:
def toarray(self, order=None, out=None):
"""See the docstring for `spmatrix.toarray`."""
B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:
def _process_toarray_args(self, order, out):
...
return np.zeros(self.shape, dtype=self.dtype, order=order)
我是通过在scipy
github上进行代码搜索(针对np.zeros
调用)发现的.
I found this by doing a code search on the scipy
github, for the np.zeros
calls.
这篇关于按特定顺序将空列插入scipy稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!