为什么 lil_matrix 和 dok_matrix 与普通的 dicts 相比如此慢? [英] Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

查看:28
本文介绍了为什么 lil_matrix 和 dok_matrix 与普通的 dicts 相比如此慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想迭代地构建稀疏矩阵,并注意到根据 SciPy 文档有两个合适的选项:

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:

LiL矩阵:

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None,copy=False)[source] 基于行的链表稀疏矩阵

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix

这是构造稀疏矩阵的有效结构逐渐增加.

This is an efficient structure for constructing sparse matrices incrementally.

DoK矩阵:

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None,copy=False)[源代码] 基于稀疏矩阵的键字典.

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.

这是构造稀疏矩阵的有效结构逐渐增加.

This is an efficient structure for constructing sparse matrices incrementally.

但是当我运行基准测试与构建值字典(稍后可以轻松转换为稀疏矩阵)相比时,后者比使用任何稀疏矩阵快 10-20 倍模型:

But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:

from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict

def common_dict(rows, cols):
    freqs = defaultdict(lambda: defaultdict(int))
    for row, col in zip(rows, cols):
        freqs[row][col] += 1

    return freqs

def dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs

def lil(rows, cols):
    freqs = lil_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs


def benchmark():
    cols = range(1000)
    rows = range(1000)

    res = timeit("common_dict({},{})".format(rows, cols), 
                 "from __main__ import common_dict", 
                 number=100)

    print("common_dict: {}".format(res))

    res = timeit("dok({},{})".format(rows, cols), 
                 "from __main__ import dok", 
                 number=100)

    print("dok: {}".format(res))

    res = timeit("lil({},{})".format(rows, cols), 
                 "from __main__ import lil", 
                 number=100)

    print("lil: {}".format(res))

结果:

benchmark()

common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666

是什么导致了矩阵模型的这种开销,有什么方法可以加快速度?是否有使用 dok 或 lil 优于通用 dicts 的用例?

What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?

推荐答案

当我将您的 += 更改为仅用于您的 2 个稀疏数组的 = 时:

When I change your += to just = for your 2 sparse arrays:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

他们各自的时间减半.消耗最多时间的是索引.使用 += 它必须同时执行 __getitem____setitem__.

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

当文档说 doklil 更适合迭代构造时,这意味着与其他格式相比,扩展其底层数据结构更容易.

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

当我尝试使用您的代码制作 csr 矩阵时,我得到:

When I try to make a csr matrix with your code, I get a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: 更改 csr_matrix 的稀疏结构代价高昂.lil_matrix 效率更高.稀疏效率警告)

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

速度降低 30 倍.

因此,速度声明与 csr 等格式有关,与纯 Python 或 numpy 结构无关.

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

您可能想查看 dok_matrix.__get_item__dok_matrix.__set_item__ 的 Python 代码,看看当您执行 freq[r,c] 时会发生什么.

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].

构建 dok 的更快方法是:

A faster way to construct your dok would be:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

利用 dok 是子类字典这一事实.请注意,dok 矩阵不是字典字典.它的键是像 (50,50) 这样的元组.

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

构造相同稀疏数组的另一种快速方法是:

Another fast way of constructing the same sparse array is:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

换句话说,既然你已经有了 rowscols 数组(或范围),计算对应的 data 数组,然后构造稀疏数组.

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

但如果您必须在增量增长步骤之间对矩阵执行稀疏操作,那么 doklil 可能是您的最佳选择.

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.

稀疏矩阵是为线性代数问题开发的,例如用大型稀疏矩阵求解线性方程.我多年前在 MATLAB 中使用它们来解决有限差分问题.对于这项工作,计算友好的csr 格式是最终目标,而coo 格式是一种方便的初始化格式.

Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

现在许多 SO scipy 稀疏问题来自 scikit-learn 和文本分析问题.它们也用于生物数据库文件.但是 (data),(row,col) 定义方法效果最好.

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

所以稀疏矩阵从来都不是用于快速增量创建的.像字典和列表这样的传统 Python 结构在这方面要好得多.

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.

这是一个更快的 dok 迭代,它利用了它的字典方法.update 似乎和普通字典一样快.get 比等效索引 (freq[row,col]) 快大约 3 倍.索引可能使用get,但肯定有很多开销.

Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

跳过get,只做

         freqs.update({(row,col): 1)

甚至更快 - 比 defaultdict 示例的 defaultdict 更快,几乎和简单的字典初始化一样快({(r, c):1 for r,c in zip(rows, cols)})

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

这篇关于为什么 lil_matrix 和 dok_matrix 与普通的 dicts 相比如此慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆