如何将共现矩阵转换为稀疏矩阵 [英] How to convert co-occurrence matrix to sparse matrix

查看:164
本文介绍了如何将共现矩阵转换为稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始处理稀疏矩阵,因此我对这个主题不是很精通.我的问题是,我从单词列表中获得了一个简单的coo-occurrences矩阵,只是一个二维共现矩阵,逐个单词地计算一个单词在相同上下文中出现的次数.矩阵非常稀疏,因为语料库不是那么大.我想将其转换为稀疏矩阵,以便能够更好地处理它,最后再进行一些矩阵乘法.这是到目前为止我所做的事情(只有第一部分,其余只是输出格式和清洗数据):

I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):

def matrix(from_corpus):    
d = defaultdict(lambda : defaultdict(int))
        heads = set() 
        trans = set()
        for text in corpus:
            d[text[0]][text[1]] += 1
            heads.add(text[0])
            trans.add(text[1])

        return d,heads,trans

我的想法是创建一个新功能:

My idea would be to make a new function:

def matrix_to_sparse(d):
    A = sparse.lil_matrix(d)

这有意义吗?但是,这是行不通的,而且我不知道如何获得稀疏矩阵.我应该更好地使用numpy数组吗?什么是做到这一点的最佳方法.我想比较多种处理矩阵的方法.

Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.

如果有人能指引我前进,那就太好了.

It would be nice if some could put me in the direction.

推荐答案

在这里,您可以根据SciPy的COO格式从一组文档中构建文档项矩阵A,这是在易用性和效率之间取得很好的折衷方案(*):

Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):

vocabulary = {}  # map terms to column indices
data = []        # values (maybe weights)
row = []         # row (document) indices
col = []         # column (term) indices

for i, doc in enumerate(documents):
    for term in doc:
        # get column index, adding the term to the vocabulary if needed
        j = vocabulary.setdefault(term, len(vocabulary))
        data.append(1)  # uniform weights
        row.append(i)
        col.append(j)

A = scipy.sparse.coo_matrix((data, (row, col)))

现在,获取同现矩阵:

A.T * A

(忽略对角线,该对角线包含项与它们自身的同现,即平方频率).

(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).

或者,请使用一些可以为您完成此类操作的软件包,例如 Gensim scikit-learn . (我是这两个项目的贡献者,所以这可能不是公正的建议.)

Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)

这篇关于如何将共现矩阵转换为稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆