如何更快地读取/遍历/切片Scipy稀疏矩阵(LIL,CSR,COO,DOK)? [英] How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

查看:366
本文介绍了如何更快地读取/遍历/切片Scipy稀疏矩阵(LIL,CSR,COO,DOK)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常使用内置方法来操纵Scipy矩阵.但是有时您需要读取矩阵数据以将其分配给非稀疏数据类型.为了演示,我创建了一个随机的LIL稀疏矩阵,并使用不同的方法将其转换为Numpy数组(纯python数据类型会更好!).

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of demonstration I created a random LIL sparse matrix and converted it to a Numpy array (pure python data types would have made a better sense!) using different methods.

from __future__ import print_function
from scipy.sparse import rand, csr_matrix, lil_matrix
import numpy as np

dim = 1000
lil = rand(dim, dim, density=0.01, format='lil', dtype=np.float32, random_state=0)
print('number of nonzero elements:', lil.nnz)
arr = np.zeros(shape=(dim,dim), dtype=float)

非零元素数:10000

number of nonzero elements: 10000

%%timeit -n3
for i in xrange(dim):
    for j in xrange(dim):
        arr[i,j] = lil[i,j]

3个循环,每个循环最好3:6.42 s

3 loops, best of 3: 6.42 s per loop

%%timeit -n3
nnz = lil.nonzero() # indices of nonzero values
for i, j in zip(nnz[0], nnz[1]):
    arr[i,j] = lil[i,j]

3个循环,每个循环最好3:75.8 ms

3 loops, best of 3: 75.8 ms per loop

这不是不是读取矩阵数据的通用解决方案,因此它不算作解决方案.

This one is not a general solution for reading the matrix data, so it does not count as a solution.

%timeit -n3 arr = lil.toarray()

3个循环,每个循环最好3:7.85 ms

3 loops, best of 3: 7.85 ms per loop

使用这些方法读取Scipy稀疏矩阵根本没有效率.有没有更快的方法来读取这些矩阵?

Reading Scipy sparse matrices with these methods is not efficient at all. Is there any faster way to read these matrices?

推荐答案

尝试读取原始数据. Scipy稀疏矩阵存储在Numpy ndarray中,每个矩阵具有不同的格式.

Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.

%%timeit -n3
for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
    for j, val in zip(row, data):
        arr[i,j] = val

3 loops, best of 3: 4.61 ms per loop

对于csr矩阵,从原始数据中读取的Python语言要少一些,但是值得这样做.

For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.

csr = lil.tocsr()

%%timeit -n3
start = 0
for i, end in enumerate(csr.indptr[1:]):
    for j, val in zip(csr.indices[start:end], csr.data[start:end]):
        arr[i,j] = val
    start = end

3 loops, best of 3: 8.14 ms per loop

此DBSCAN实现中使用了类似的方法.

Similar approach is used in this DBSCAN implementation.

%%timeit -n3
for i,j,d in zip(coo.row, coo.col, coo.data):
    arr[i,j] = d

3 loops, best of 3: 5.97 ms per loop

基于以下有限测试:

  • COO矩阵:最干净
  • LIL矩阵:最快
  • CSR矩阵:最慢和最丑.唯一的好处是,与CSR的转换非常快.

从@hpaulj起,我添加了COO矩阵以将所有方法都放在一个地方.

from @hpaulj I added COO matrix to have all the methods in one place.

这篇关于如何更快地读取/遍历/切片Scipy稀疏矩阵(LIL,CSR,COO,DOK)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆