如何在python上逐步创建稀疏矩阵? [英] How to incrementally create an sparse matrix on python?
问题描述
我正在创建一个同时出现的矩阵,其大小为1M x 1M整数. 创建矩阵后,我唯一要执行的操作是获取每行(或列,因为它是对称矩阵)的前N个值.
I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers. After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
我必须创建稀疏矩阵才能使其适合内存.我从一个大文件中读取输入数据,并逐步更新两个索引(行,列)的共存.
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
稀疏dok_matrix的示例代码指定我应该事先声明矩阵的大小.我知道矩阵的上限(1m x 1m),但实际上它可能小于该上限. 我必须预先指定大小,还是可以逐步创建大小?
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that. Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element
推荐答案
A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil
format; I recommended building the input arrays for a coo
format.
在其他SO问题中,我发现将值添加到普通字典比将值添加到dok
矩阵要快-即使dok
是字典子类. dok
索引方法有很多开销.在某些情况下,我建议使用元组键构建字典,然后使用update
将值添加到已定义的dok
中.但我怀疑在您的情况下,coo
路线会更好.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok
matrix - even though a dok
is a dictionary subclass. There's quite a bit of over head in the dok
indexing method. In some cases I suggested building a dict with a tuple key, and using update
to add the values to a defined dok
. But I suspect in your case the coo
route is better.
dok
和lil
是用于增量构造的最佳格式,但是与python list和dict方法相比,它们都不是那么好.
dok
and lil
are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
关于每行的top N values
,我记得曾经进行过探索,但是要花点时间,所以不能轻易提出一个很好的SO问题.您可能希望使用面向行的格式,例如lil
或csr
.
As for the top N values
of each row, I recall exploring that, but back some time, so can't off hand pull up a good SO question. You probably want a row oriented format such as lil
or csr
.
关于问题-'您是否需要在创建时指定大小'.是的.由于稀疏矩阵(无论格式如何)仅存储非零值,因此创建太大的矩阵几乎没有害处.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
我想不出以shape
为中心的dok
或coo
格式的任何内容-至少在数据存储或创建方面没有. lil
和csr
将具有一些额外的值.如果您真的需要探索这一点,请阅读有关如何存储值的信息,并使用较小的矩阵进行操作.
I can't think of anything in a dok
or coo
format that hinges on the shape
- at least not in terms of data storage or creation. lil
and csr
will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
=================
==================
看来dok
格式的所有代码都是
It looks like all the code for dok
format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
扫描该文件,我发现dok
确实具有resize
方法
Scanning that file, I see that dok
does have a resize
method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
因此,如果您想将矩阵初始化为1M x 1M
并将其大小调整为<c22>,则可以执行此操作-它会逐步检查所有键,以确保新范围内没有任何键.因此,即使主要动作是更改shape参数,它也不便宜.
So if you want to initial the matrix to 1M x 1M
and resize to 100 x 100
you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
如果您确定没有任何外键,则可以直接更改形状.其他稀疏格式没有resize
方法.
If you know for sure that there aren't any outside keys, you could change shape directly. The other sparse formats don't have a resize
method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
另请参阅:
为什么lil_matrix和dok_matrix与普通的dict相比这么慢?
这篇关于如何在python上逐步创建稀疏矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!