与普通的字典相比,为什么lil_matrix和dok_matrix这么慢? [英] Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
问题描述
我想迭代地构建稀疏矩阵,并注意到根据SciPy文档,有两个合适的选择:
I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:
scipy.sparse.lil_matrix(arg1,shape = None,dtype = None, copy = False)[source]基于行的链表稀疏矩阵
class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix
这是构造稀疏矩阵的有效结构
This is an efficient structure for constructing sparse matrices incrementally.
scipy.sparse.dok_matrix类(arg1,shape = None,dtype = None, copy = False)[source]基于Keys的稀疏矩阵字典.
class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.
这是构造稀疏矩阵的有效结构
This is an efficient structure for constructing sparse matrices incrementally.
但是,当我运行基准测试与构建值字典字典(以后可以轻松将其转换为稀疏矩阵)进行比较时,后者比使用任何稀疏矩阵快约10-20倍型号:
But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:
from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict
def common_dict(rows, cols):
freqs = defaultdict(lambda: defaultdict(int))
for row, col in zip(rows, cols):
freqs[row][col] += 1
return freqs
def dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def lil(rows, cols):
freqs = lil_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def benchmark():
cols = range(1000)
rows = range(1000)
res = timeit("common_dict({},{})".format(rows, cols),
"from __main__ import common_dict",
number=100)
print("common_dict: {}".format(res))
res = timeit("dok({},{})".format(rows, cols),
"from __main__ import dok",
number=100)
print("dok: {}".format(res))
res = timeit("lil({},{})".format(rows, cols),
"from __main__ import lil",
number=100)
print("lil: {}".format(res))
结果:
benchmark()
common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666
是什么导致矩阵模型如此大的开销,有什么方法可以加快这种开销?在某些情况下,dok或lil比普通的dict更喜欢吗?
What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?
推荐答案
当我将2个稀疏数组的+=
更改为=
时:
When I change your +=
to just =
for your 2 sparse arrays:
for row, col in zip(rows, cols):
#freqs[row,col] += 1
freqs[row,col] = 1
他们各自的时间减少了一半.消耗时间最多的是索引.对于+=
,它必须同时执行__getitem__
和__setitem__
.
their respective times are cut in half. What's consuming the most time is the indexing. With +=
it is has to do both a __getitem__
and a __setitem__
.
当文档说dok
和lil
对于迭代构造更好时,它们意味着扩展其基础数据结构比其他格式更容易.
When the docs say that dok
and lil
are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.
当我尝试使用您的代码创建csr
矩阵时,我得到:
When I try to make a csr
matrix with your code, I get a:
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690:SparseEfficiencyWarning:更改csr_matrix的稀疏结构非常昂贵. lil_matrix效率更高. 稀疏效率警告)
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)
,速度降低30倍.
因此,速度声明是相对于csr
之类的格式而言的,而不是相对于纯Python或numpy
结构而言的.
So the speed claims are relative to formats like csr
, not relative to pure Python or numpy
structures.
您可能想查看dok_matrix.__get_item__
和dok_matrix.__set_item__
的Python代码,以了解执行freq[r,c]
时发生的情况.
You might want to look at the Python code for dok_matrix.__get_item__
and dok_matrix.__set_item__
to see what happens when you do freq[r,c]
.
构造dok
的更快方法是:
freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
d[(row, col)] = 1
freqs.update(d)
利用了dok
是子类字典的事实.请注意,dok
矩阵不是词典的字典.它的键是类似(50,50)
的元组.
taking advantage of the fact that a dok
is a subclassed dictionary. Note that dok
matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50)
.
构造相同稀疏数组的另一种快速方法是:
Another fast way of constructing the same sparse array is:
freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))
换句话说,由于您已经具有rows
和cols
数组(或范围),因此请计算相应的data
数组,然后构造稀疏数组.
In other words, since you already have the rows
and cols
arrays (or ranges), calculate the corresponding data
array, and THEN construct the sparse array.
但是,如果您必须在增量增长步骤之间对矩阵执行稀疏运算,那么dok
或lil
可能是您的最佳选择.
But if you must perform sparse operations on your matrix between incremental growth steps, then dok
or lil
may be your best choices.
稀疏矩阵是为线性代数问题开发的,例如求解具有大稀疏矩阵的线性方程.几年前,我在MATLAB中使用它们来解决有限差分问题.对于这项工作,计算友好的csr
格式是最终目标,而coo
格式是一种方便的初始化格式.
Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr
format is the ultimate goal, and the coo
format was a convenient initialization format.
现在,许多SO稀疏问题都来自scikit-learn
和文本分析问题.它们还用于生物学数据库文件中.但是(data),(row,col)
定义方法仍然效果最好.
Now many of the SO scipy sparse questions arise from scikit-learn
and text analysis problems. They are also used in a biological database files. But still the (data),(row,col)
definition method works best.
因此,稀疏矩阵从未打算用于快速增量创建.字典和列表之类的传统Python结构对此要好得多.
So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.
这是一个更快的dok
迭代,它利用了其字典方法. update
似乎和普通字典一样快. get
大约是等效索引(freq[row,col]
)的3倍.索引可能使用get
,但必须有很多开销.
Here's a faster dok
iteration that takes advantage of its dictionary methods. update
seems to work as fast as on a plain dictionary. get
is about 3x faster the equivalent indexing (freq[row,col]
). Indexing probably uses get
, but must have a lot of overhead.
def fast_dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows,cols):
i = freqs.get((row,col),0)
freqs.update({(row,col):i+1})
return freqs
跳过get
,然后做
freqs.update({(row,col): 1)
甚至更快-比defaultdict示例的defaultdict更快,几乎与简单的字典初始化({(r, c):1 for r,c in zip(rows, cols)}
)一样快
is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)}
)
这篇关于与普通的字典相比,为什么lil_matrix和dok_matrix这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!