如何计算术语文档矩阵? [英] how to calculate term-document matrix?

查看：173 发布时间：2021/7/16 20:07:47 python scikit-learn scipy term-document-matrix

本文介绍了如何计算术语文档矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道 Term-Document Matrix 是一个数学矩阵，它描述了在文档集合中出现的术语的频率.在文档-术语矩阵中，行对应于集合中的文档，列对应于术语.

我正在使用 sklearn 的 CountVectorizer 从字符串(文本文件)中提取特征以简化我的任务.以下代码根据
我不明白这个矩阵是如何计算的.请讨论代码中显示的示例.我从维基百科中阅读了另外一个例子，但无法理解.

解决方案

CountVectorizer().fit_transform() 的输出是一个稀疏矩阵.这意味着它只会存储矩阵的非零元素.当您执行 print(X) 时，仅显示您在图像中观察到的非零条目.

至于计算是怎么做的，你可以看看此处为官方文档.

CountVectorizer 在其默认配置中，对给定的文档或原始文本进行标记化(它将仅采用其中包含 2 个或更多字符的术语)并计算单词出现次数.

基本上步骤如下:

步骤 1 - 从 fit() 中存在的所有文档中收集所有不同的术语.
对于您的数据，它们是[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']这可以从 vectorizer.get_feature_names()
Step2 - 在 transform() 中，计算每个文档中出现在 fit() 中的术语数，将其输出到术语中 -频率矩阵.
在您的情况下，您将两个文档都提供给 transform()(fit_transform() 是 fit() 的简写，然后是 transform()).所以，结果是
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

第一个 1 1 1 1 1 0 1

秒 0 1 1 0 0 1 0

调用X.toarray()即可得到上述结果.

在您发布的 print(X) 图像中，第一列表示 term-freq 矩阵的索引，第二列表示该术语的频率.

<0,0> 表示第一行、第一列，即词条 "disk"(我们令牌中的第一个词项) 在第一个文档中的频率 = 1

<0,2> 表示第一行，第三列，即词条 "hard"(我们的标记中的第三项) 在第一个文档中的频率 = 1

<0,5> 表示第一行，第六列，即第一个文档中术语 problems"(我们令牌中的第六个术语) 的频率 = 0.但是因为它是 0，所以它不会显示在您的图像中.

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.



I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows 
   
 I am not getting how this matrix has been calculated.please discuss the example shown in the code.  I have read one more example from the Wikipedia but could not understand.
 解决方案 
The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.

As for how the calculation is done, you can have a look at the official documentation here.

The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences. 

Basically, the steps are as follow:


Step1 - Collect all different terms from all the documents present in fit().

For your data, they are 
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available from vectorizer.get_feature_names()
Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.

In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is 

[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']


First  1          1         1        1      1         0          1

Sec    0          1         1        0      0         1          0

You can get the above result by calling X.toarray().

In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term. 

<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1

<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1

<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

                        这篇关于如何计算术语文档矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何计算术语文档矩阵? [英] how to calculate term-document matrix?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何计算术语文档矩阵? [英] how to calculate term-document matrix?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭