如何计算术语文档矩阵? [英] how to calculate term-document matrix?
问题描述
我知道 Term-Document Matrix 是一个数学矩阵,它描述了在文档集合中出现的术语的频率.在文档-术语矩阵中,行对应于集合中的文档,列对应于术语.
我正在使用 sklearn 的 CountVectorizer 从字符串(文本文件)中提取特征以简化我的任务.以下代码根据
我不明白这个矩阵是如何计算的.请讨论代码中显示的示例.我从 维基百科 中阅读了另外一个例子,但无法理解.
CountVectorizer().fit_transform()
的输出是一个稀疏矩阵.这意味着它只会存储矩阵的非零元素.当您执行 print(X)
时,仅显示您在图像中观察到的非零条目.
至于计算是怎么做的,你可以看看此处为官方文档.
CountVectorizer
在其默认配置中,对给定的文档或原始文本进行标记化(它将仅采用其中包含 2 个或更多字符的术语)并计算单词出现次数.
基本上步骤如下:
步骤 1 - 从
fit()
中存在的所有文档中收集所有不同的术语.对于您的数据,它们是
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
这可以从vectorizer.get_feature_names()
获得Step2 - 在
transform()
中,计算每个文档中出现在fit()
中的术语数,将其输出到术语中 -频率矩阵.在您的情况下,您将两个文档都提供给 transform()(
fit_transform()
是fit()
的简写,然后是transform()代码>).所以,结果是
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
第一个 1 1 1 1 1 0 1
秒 0 1 1 0 0 1 0
调用X.toarray()
即可得到上述结果.
在您发布的 print(X) 图像中,第一列表示 term-freq 矩阵的索引,第二列表示该术语的频率.
<0,0>
表示第一行、第一列,即词条 "disk"(我们令牌中的第一个词项)
在第一个文档中的频率 = 1>
<0,2>
表示第一行,第三列,即词条 "hard"(我们的标记中的第三项)
在第一个文档中的频率 = 1>
<0,5>
表示第一行,第六列,即第一个文档中术语 problems"(我们令牌中的第六个术语)
的频率 = 0.但是因为它是 0,所以它不会显示在您的图像中.
I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows
I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.
The output of a CountVectorizer().fit_transform()
is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X)
, only the non-zero entries are displayed as you observe in the image.
As for how the calculation is done, you can have a look at the official documentation here.
The CountVectorizer
in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.
Basically, the steps are as follow:
Step1 - Collect all different terms from all the documents present in
fit()
.For your data, they are
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available fromvectorizer.get_feature_names()
Step2 - In the
transform()
, count the number of terms in each document which were present in thefit()
output it in the term-frequency matrix.In your case, you are supplying both documents to transform() (
fit_transform()
is a shorthand forfit()
and thentransform()
). So, the result is[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
First 1 1 1 1 1 0 1
Sec 0 1 1 0 0 1 0
You can get the above result by calling X.toarray()
.
In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.
<0,0>
means first row, first column i.e frequencies of term "disk" (first term in our tokens)
in first document = 1
<0,2>
means first row, third column i.e frequencies of term "hard" (third term in our tokens)
in first document = 1
<0,5>
means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens)
in first document = 0. But since it is 0, it is not displayed in your image.
这篇关于如何计算术语文档矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!