如何使用sklearn计算单词-单词共现矩阵? [英] How do I calculate a word-word co-occurrence matrix with sklearn?

查看：351 发布时间：2020/5/7 18:38:36 python matrix scikit-learn

本文介绍了如何使用sklearn计算单词-单词共现矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找sklearn中的模块，该模块可让您导出单词-单词共现矩阵.

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

我可以得到文档术语矩阵，但是不确定如何获得同现词-词-词矩阵.

I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.

推荐答案

这是我在scikit-learn中使用CountVectorizer的示例解决方案.并参考此帖子，您可以简单地使用矩阵乘法获得单词-单词共现矩阵.

Here is my example solution using CountVectorizer in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.

from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
        'this cat good',
        'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format

您还可以参考count_model，

count_model.vocabulary_

或者，如果要按对角线分量归一化(在上一篇文章中称为答案).

Or, if you want to normalize by diagonal component (referred to answer in previous post).

import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix

额外以记下@Federico Caccia的答案，如果您不希望自己的文本中出现伪造的同现，请将发生率设置为大于1到1，例如

Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.

X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...

这篇关于如何使用sklearn计算单词-单词共现矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用sklearn计算单词-单词共现矩阵? [英] How do I calculate a word-word co-occurrence matrix with sklearn?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用sklearn计算单词-单词共现矩阵? [英] How do I calculate a word-word co-occurrence matrix with sklearn?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭