与普通的字典相比,为什么lil_matrix和dok_matrix这么慢? [英] Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

查看:58
本文介绍了与普通的字典相比,为什么lil_matrix和dok_matrix这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想迭代地构建稀疏矩阵,并注意到根据SciPy文档,有两个合适的选择:

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:

LiL矩阵:

scipy.sparse.lil_matrix(arg1,shape = None,dtype = None, copy = False)[source]基于行的链表稀疏矩阵

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix

这是构造稀疏矩阵的有效结构

This is an efficient structure for constructing sparse matrices incrementally.

DoK矩阵:

scipy.sparse.dok_matrix类(arg1,shape = None,dtype = None, copy = False)[source]基于Keys的稀疏矩阵字典.

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.

这是构造稀疏矩阵的有效结构

This is an efficient structure for constructing sparse matrices incrementally.

但是,当我运行基准测试与构建值字典字典(以后可以轻松将其转换为稀疏矩阵)进行比较时,后者比使用任何稀疏矩阵快约10-20倍型号:

But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:

from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict

def common_dict(rows, cols):
    freqs = defaultdict(lambda: defaultdict(int))
    for row, col in zip(rows, cols):
        freqs[row][col] += 1

    return freqs

def dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs

def lil(rows, cols):
    freqs = lil_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs


def benchmark():
    cols = range(1000)
    rows = range(1000)

    res = timeit("common_dict({},{})".format(rows, cols), 
                 "from __main__ import common_dict", 
                 number=100)

    print("common_dict: {}".format(res))

    res = timeit("dok({},{})".format(rows, cols), 
                 "from __main__ import dok", 
                 number=100)

    print("dok: {}".format(res))

    res = timeit("lil({},{})".format(rows, cols), 
                 "from __main__ import lil", 
                 number=100)

    print("lil: {}".format(res))

结果:

benchmark()

common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666

是什么导致矩阵模型如此大的开销,有什么方法可以加快这种开销?在某些情况下,dok或lil比普通的dict更喜欢吗?

What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?

推荐答案

当我将2个稀疏数组的+=更改为=时:

When I change your += to just = for your 2 sparse arrays:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

他们各自的时间减少了一半.消耗时间最多的是索引.对于+=,它必须同时执行__getitem____setitem__.

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

当文档说doklil对于迭代构造更好时,它们意味着扩展其基础数据结构比其他格式更容易.

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

当我尝试使用您的代码创建csr矩阵时,我得到:

When I try to make a csr matrix with your code, I get a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690:SparseEfficiencyWarning:更改csr_matrix的稀疏结构非常昂贵. lil_matrix效率更高. 稀疏效率警告)

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

,速度降低30倍.

因此,速度声明是相对于csr之类的格式而言的,而不是相对于纯Python或numpy结构而言的.

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

您可能想查看dok_matrix.__get_item__dok_matrix.__set_item__的Python代码,以了解执行freq[r,c]时发生的情况.

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].

构造dok的更快方法是:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

利用了dok是子类字典的事实.请注意,dok矩阵不是词典的字典.它的键是类似(50,50)的元组.

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

构造相同稀疏数组的另一种快速方法是:

Another fast way of constructing the same sparse array is:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

换句话说,由于您已经具有rowscols数组(或范围),因此请计算相应的data数组,然后构造稀疏数组.

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

但是,如果您必须在增量增长步骤之间对矩阵执行稀疏运算,那么doklil可能是您的最佳选择.

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.

稀疏矩阵是为线性代数问题开发的,例如求解具有大稀疏矩阵的线性方程.几年前,我在MATLAB中使用它们来解决有限差分问题.对于这项工作,计算友好的csr格式是最终目标,而coo格式是一种方便的初始化格式.

Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

现在,许多SO稀疏问题都来自scikit-learn和文本分析问题.它们还用于生物学数据库文件中.但是(data),(row,col)定义方法仍然效果最好.

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

因此,稀疏矩阵从未打算用于快速增量创建.字典和列表之类的传统Python结构对此要好得多.

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.

这是一个更快的dok迭代,它利用了其字典方法. update似乎和普通字典一样快. get大约是等效索引(freq[row,col])的3倍.索引可能使用get,但必须有很多开销.

Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

跳过get,然后做

         freqs.update({(row,col): 1)

甚至更快-比defaultdict示例的defaultdict更快,几乎与简单的字典初始化({(r, c):1 for r,c in zip(rows, cols)})一样快

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

这篇关于与普通的字典相比,为什么lil_matrix和dok_matrix这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆