如何在python上逐步创建稀疏矩阵? [英] How to incrementally create an sparse matrix on python?

查看:431
本文介绍了如何在python上逐步创建稀疏矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个同时出现的矩阵,其大小为1M x 1M整数. 创建矩阵后,我唯一要执行的操作是获取每行(或列,因为它是对称矩阵)的前N个值.

I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers. After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).

我必须创建稀疏矩阵才能使其适合内存.我从一个大文件中读取输入数据,并逐步更新两个索引(行,列)的共存.

I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.

稀疏dok_matrix的示例代码指定我应该事先声明矩阵的大小.我知道矩阵的上限(1m x 1m),但实际上它可能小于该上限. 我必须预先指定大小,还是可以逐步创建大小?

The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that. Do I have to specify the size beforehand, or can i just create it incrementally?

import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i, j] = i + j    # Update element

推荐答案

几天前的一个问题,

A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.

在其他SO问题中,我发现将值添加到普通字典比将值添加到dok矩阵要快-即使dok是字典子类. dok索引方法有很多开销.在某些情况下,我建议使用元组键构建字典,然后使用update将值添加到已定义的dok中.但我怀疑在您的情况下,coo路线会更好.

In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of over head in the dok indexing method. In some cases I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.

doklil是用于增量构造的最佳格式,但是与python list和dict方法相比,它们都不是那么好.

dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.

关于每行的top N values,我记得曾经进行过探索,但是要花点时间,所以不能轻易提出一个很好的SO问题.您可能希望使用面向行的格式,例如lilcsr.

As for the top N values of each row, I recall exploring that, but back some time, so can't off hand pull up a good SO question. You probably want a row oriented format such as lil or csr.

关于问题-'您是否需要在创建时指定大小'.是的.由于稀疏矩阵(无论格式如何)仅存储非零值,因此创建太大的矩阵几乎没有害处.

As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.

我想不出以shape为中心的dokcoo格式的任何内容-至少在数据存储或创建方面没有. lilcsr将具有一些额外的值.如果您真的需要探索这一点,请阅读有关如何存储值的信息,并使用较小的矩阵进行操作.

I can't think of anything in a dok or coo format that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.

=================

==================

看来dok格式的所有代码都是

It looks like all the code for dok format is Python in

/usr/lib/python3/dist-packages/scipy/sparse/dok.py

/usr/lib/python3/dist-packages/scipy/sparse/dok.py

扫描该文件,我发现dok确实具有resize方法

Scanning that file, I see that dok does have a resize method

d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.

Any non-zero elements that lie outside the new shape are removed.
File:      /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type:      method

因此,如果您想将矩阵初始化为1M x 1M并将其大小调整为<​​c22>,则可以执行此操作-它会逐步检查所有键,以确保新范围内没有任何键.因此,即使主要动作是更改shape参数,它也不便宜.

So if you want to initial the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.

    newM, newN = shape
    M, N = self.shape
    if newM < M or newN < N:
        # Remove all elements outside new dimensions
        for (i, j) in list(self.keys()):
            if i >= newM or j >= newN:
                del self[i, j]
    self._shape = shape

如果您确定没有任何外键,则可以直接更改形状.其他稀疏格式没有resize方法.

If you know for sure that there aren't any outside keys, you could change shape directly. The other sparse formats don't have a resize method.

In [31]: d=sparse.dok_matrix((10,10),int)

In [32]: d
Out[32]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [33]: d.resize((5,5))

In [34]: d
Out[34]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [35]: d._shape=(9,9)

In [36]: d
Out[36]: 
<9x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

另请参阅:

为什么lil_matrix和dok_matrix与普通的dict相比这么慢?

获取前n个项目稀疏矩阵中每一行的位置

这篇关于如何在python上逐步创建稀疏矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆