如何将共现矩阵转换为稀疏矩阵 [英] How to convert co-occurrence matrix to sparse matrix

查看：164 发布时间：2020/8/6 2:20:14 python scipy sparse-matrix

本文介绍了如何将共现矩阵转换为稀疏矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我开始处理稀疏矩阵，因此我对这个主题不是很精通.我的问题是，我从单词列表中获得了一个简单的coo-occurrences矩阵，只是一个二维共现矩阵，逐个单词地计算一个单词在相同上下文中出现的次数.矩阵非常稀疏，因为语料库不是那么大.我想将其转换为稀疏矩阵，以便能够更好地处理它，最后再进行一些矩阵乘法.这是到目前为止我所做的事情(只有第一部分，其余只是输出格式和清洗数据):

I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):

def matrix(from_corpus):    
d = defaultdict(lambda : defaultdict(int))
        heads = set() 
        trans = set()
        for text in corpus:
            d[text[0]][text[1]] += 1
            heads.add(text[0])
            trans.add(text[1])

        return d,heads,trans

我的想法是创建一个新功能:

My idea would be to make a new function:

def matrix_to_sparse(d):
    A = sparse.lil_matrix(d)

这有意义吗?但是，这是行不通的，而且我不知道如何获得稀疏矩阵.我应该更好地使用numpy数组吗?什么是做到这一点的最佳方法.我想比较多种处理矩阵的方法.

Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.

如果有人能指引我前进，那就太好了.

It would be nice if some could put me in the direction.

推荐答案

在这里，您可以根据SciPy的COO格式从一组文档中构建文档项矩阵A，这是在易用性和效率之间取得很好的折衷方案(*):

Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):

vocabulary = {}  # map terms to column indices
data = []        # values (maybe weights)
row = []         # row (document) indices
col = []         # column (term) indices

for i, doc in enumerate(documents):
    for term in doc:
        # get column index, adding the term to the vocabulary if needed
        j = vocabulary.setdefault(term, len(vocabulary))
        data.append(1)  # uniform weights
        row.append(i)
        col.append(j)

A = scipy.sparse.coo_matrix((data, (row, col)))

现在，获取同现矩阵:

A.T * A

(忽略对角线，该对角线包含项与它们自身的同现，即平方频率).

(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).

或者，请使用一些可以为您完成此类操作的软件包，例如 Gensim 或 scikit-learn . (我是这两个项目的贡献者，所以这可能不是公正的建议.)

Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)

这篇关于如何将共现矩阵转换为稀疏矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将共现矩阵转换为稀疏矩阵 [英] How to convert co-occurrence matrix to sparse matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将共现矩阵转换为稀疏矩阵 [英] How to convert co-occurrence matrix to sparse matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭