按特定顺序将空列插入scipy稀疏矩阵 [英] Inserting null columns into a scipy sparse matrix in a specific order

查看:76
本文介绍了按特定顺序将空列插入scipy稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含M行和N列的稀疏矩阵,我想将K个其他的NULL列连接在一起,因此我的对象现在将具有M行和(N + K)列.棘手的部分是,我还有一个长度为N的索引列表,其长度范围从0到N + K,它们指示在新矩阵中每一列应具有的位置.

I have a sparse matrix with M rows and N columns, to which I want to concatenate K additional NULL columns so my objects will have now M rows and (N+K) columns. The tricky part is that I also have a list of indeces of length N, which can range from 0 to N+K, that indicate what is the position that every column should have in the new matrix.

因此,例如,如果N = 2,K = 1且索引列表为[2,0],则意味着我想将MxN矩阵中的最后一列作为第一列,引入空列,然后将我的第一列作为最后一列.

So for example, if N = 2, K = 1 and the list of indices is [2, 0], it means that I want to take the last column from my MxN matrix to be the first one, the introduce a null column and then put my first column as the last one.

我正在尝试使用以下代码-当我已经有x但无法在此处上传它时.

I'm trying to use the following code - when I already have x but I can't upload it here.

import numpy as np
from scipy import sparse
M = 5000
N = 10
pad_factor = 1.2
size = int(pad_factor * N)
x = sparse.random(m = M, n = N, density = 0.1, dtype = 'float64')
indeces = np.random.choice(range(size), size=N, replace=False)
null_mat = sparse.lil_matrix((M, size))
null_mat[:, indeces] = x

问题是,对于N = 1,500,000,P = 5,000和K = 200的代码,该代码将无法缩放,并且会给我带来内存错误.确切的错误是: 返回np.zeros(self.shape,dtype = self.dtype,order = order)MemoryError".

The problem is that for N = 1,500,000, P = 5,000 and K = 200 this code won't scale and it will give me a memory error. The exact error is: "return np.zeros(self.shape, dtype = self.dtype, order=order) MemoryError".

我只想添加一些空列,所以我想我的切片想法效率低下,尤其是当K<< N在我的真实数据中.在某种程度上,我们可以将其视为合并排序问题-我有一个非null和null的数据集,我想按特定顺序将它们连接起来.关于如何使它起作用的任何想法?

I just want to add some null columns so I guess my slicing idea is inefficient, especially as K << N in my real data. In a way we can think about this as a merge sort problem - I have a non-null and a null dataset and I want to concatenate them, but in a specific order. Any ideas on how to make it work?

谢谢!

推荐答案

正如我在注释中推断的那样,内存错误是在

As I deduced in the comments, the memory error was produced in the

null_mat[:, indeces] = x

行是因为lil __setitem__方法执行x.toarray(),也就是说,它首先将x转换为密集数组.将稀疏矩阵直接映射到索引lil可能会更节省空间,但是要进行大量代码编写.并且lil针对迭代分配进行了优化,而不是针对大规模矩阵映射.

line because the lil __setitem__ method, does a x.toarray(), that is, it first converts x to a dense array. Mapping the sparse matrix onto the index lil directly might be more space efficient, but a lot more work to code. And lil is optimized for iterative assignment, not this large scale matrix mapping.

sparse.hstack使用sparse.bmat联接稀疏矩阵.它将所有输入转换为coo,然后将其属性组合为一个新集合,并根据这些属性构建新矩阵.

sparse.hstack uses sparse.bmat to join sparse matrices. This converts all inputs to coo, and then combines their attributes into a new set, building the new matrix from those.

经过大量的试验,我发现以下简单的操作有效:

After quite a bit of playing around, I found that the following simple operation works:

In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))

In [480]: z1
Out[480]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>

将此与xnull_mat进行比较:

In [481]: x
Out[481]: 
<5000x10 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in LInked List format>

测试稀疏矩阵的相等性可能很棘手.特别是coo值可以按任何顺序出现,例如xsparse.random产生.

Testing the equality of sparse matrices can be tricky. coo values in particular can occur in any order, such as in x which was produced by sparse.random.

但是csr格式对行进行排序,因此indptr属性的此比较是一个很好的相等性测试:

But the csr format orders the rows, so this comparison of the indptr attribute is a pretty good equality test:

In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True

时间测试:

In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [478]: 
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

矩阵乘法方法

使用列表的

csr格式索引是通过矩阵乘法完成的.它构造一个extractor矩阵,并将其应用.矩阵乘法是csr_matrix的强项.

matrix multiplication approach

csr format indexing with lists is done with matrix multiplication. It constructs an extractor matrix, and applies that. Matrix multiplication is a csr_matrix strong point.

我们可以用相同的方式执行重新排序:

We can perform the reordering in the same way:

In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]: 
<10x12 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [496]: w1=x*I

比较这些矩阵的密集等效项:

Comparing the dense equivalents of these matrices:

In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True


In [499]: %%timeit
     ...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
     ...: 12))
     ...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这比lil索引方法更好,但是仍然比直接coo矩阵构造慢得多.公平地说,我们应该从coo样式输入构造一个csr矩阵.该转换需要一些时间:

That's better than the lil indexing approach, though still much slower than the direct coo matrix construction. Though to be fair, we should construct a csr matrix from the coo style inputs. That conversion takes some time:

In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
     ...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

错误回溯

MemoryError回溯应该已经表明该索引分配中发生了错误,并且相关的方法调用是:

error traceback

The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:

Signature: null_mat.__setitem__(index, x)
Source:   
    def __setitem__(self, index, x):
       ....
       if isspmatrix(x):
           x = x.toarray()
       ...

Signature: x.toarray(order=None, out=None)
Source:   
    def toarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:   
    def _process_toarray_args(self, order, out):
        ...
        return np.zeros(self.shape, dtype=self.dtype, order=order)

我是通过在scipy github上进行代码搜索(针对np.zeros调用)发现的.

I found this by doing a code search on the scipy github, for the np.zeros calls.

这篇关于按特定顺序将空列插入scipy稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆