如何使用 sklearn 计算词-词共现矩阵? [英] How do I calculate a word-word co-occurrence matrix with sklearn?
问题描述
我正在 sklearn 中寻找一个模块,它可以让您导出词-词共现矩阵.
I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.
我可以得到文档-术语矩阵,但不知道如何获得共现的词-词矩阵.
I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.
推荐答案
这是我在 scikit-learn 中使用 CountVectorizer
的示例解决方案.并参考此帖子,您可以简单地使用矩阵乘法得到词词共现矩阵.
Here is my example solution using CountVectorizer
in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.
from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
'this cat good',
'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format
也可以参考count_model
、
count_model.vocabulary_
或者,如果您想通过对角线分量进行归一化(参考上一篇文章中的回答).
Or, if you want to normalize by diagonal component (referred to answer in previous post).
import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix
额外要注意@Federico Caccia的回答,如果您不希望自己的文本中出现虚假的共现,请将出现次数设置为大于 1 到 1,例如
Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.
X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...
这篇关于如何使用 sklearn 计算词-词共现矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!