从字典子集中有效填充SciPy稀疏矩阵 [英] Efficiently populate SciPy sparse matrix from subset of dictionary

查看:101
本文介绍了从字典子集中有效填充SciPy稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在几个14000x10000矩阵中存储单词共现计数.由于我知道矩阵将是稀疏的,并且我没有足够的RAM来将所有矩阵存储为密集矩阵,因此我将它们存储为scipy.sparse矩阵.

I need to store word co-occurrence counts in several 14000x10000 matrices. Since I know the matrices will be sparse and I do not have enough RAM to store all of them as dense matrices, I am storing them as scipy.sparse matrices.

我发现使用Counter对象收集计数的最有效方法.现在,我需要将计数从Counter对象转移到稀疏矩阵,但这花费了很长时间.目前,填充矩阵大约需要18个小时.

I have found the most efficient way to gather the counts to be using Counter objects. Now I need to transfer the counts from the Counter objects to the sparse matrices, but this takes too long. It currently takes on the order of 18 hours to populate the matrices.

我使用的代码大致如下:

The code I'm using is roughly as follows:

for word_ind1 in range(len(wordlist1)):
    for word_ind2 in range(len(wordlist2)):
        word_counts[word_ind2, word_ind1]=word_counters[wordlist1[word_ind1]][wordlist2[word_ind2]]

其中word_counts是scipy.sparse.lil_matrix对象,word_counters是计数器的字典,而wordlist1wordlist2是字符串列表.

Where word_counts is a scipy.sparse.lil_matrix object, word_counters is a dictionary of counters, and wordlist1 and wordlist2 are lists of strings.

有什么方法可以更有效地做到这一点?

Is there any way to do this more efficiently?

推荐答案

您正在使用LIL矩阵,该矩阵(不幸的是)具有线性时间插入算法.因此,以这种方式构造它们需要花费二次时间.尝试使用DOK矩阵,那些使用哈希表进行存储.

You're using LIL matrices, which (unfortunately) have a linear-time insertion algorithm. Therefore, constructing them in this way takes quadratic time. Try a DOK matrix instead, those use hash tables for storage.

但是,如果您对布尔术语出现感兴趣,那么如果您有一个稀疏的术语文档矩阵,则计算共现矩阵会更快.假设A为形状为(n_documents, n_terms)的矩阵,则共现矩阵为

However, if you're interested in boolean term occurrences, then computing the co-occurrence matrix is much faster if you have a sparse term-document matrix. Let A be such a matrix of shape (n_documents, n_terms), then the co-occurrence matrix is

A.T * A

这篇关于从字典子集中有效填充SciPy稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆