如何计算术语文档矩阵? [英] how to calculate term-document matrix?

查看:173
本文介绍了如何计算术语文档矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 Term-Document Matrix 是一个数学矩阵,它描述了在文档集合中出现的术语的频率.在文档-术语矩阵中,行对应于集合中的文档,列对应于术语.

我正在使用 sklearn 的 CountVectorizer 从字符串(文本文件)中提取特征以简化我的任务.以下代码根据
我不明白这个矩阵是如何计算的.请讨论代码中显示的示例.我从 维基百科 中阅读了另外一个例子,但无法理解.

解决方案

CountVectorizer().fit_transform() 的输出是一个稀疏矩阵.这意味着它只会存储矩阵的非零元素.当您执行 print(X) 时,仅显示您在图像中观察到的非零条目.

至于计算是怎么做的,你可以看看此处为官方文档.

CountVectorizer 在其默认配置中,对给定的文档或原始文本进行标记化(它将仅采用其中包含 2 个或更多字符的术语)并计算单词出现次数.

基本上步骤如下:

  • 步骤 1 - 从 fit() 中存在的所有文档中收集所有不同的术语.

    对于您的数据,它们是[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']这可以从 vectorizer.get_feature_names()

  • 获得
  • Step2 - 在 transform() 中,计算每个文档中出现在 fit() 中的术语数,将其输出到术语中 -频率矩阵.

    在您的情况下,您将两个文档都提供给 transform()(fit_transform()fit() 的简写,然后是 transform()).所以,结果是

    [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

第一个 1 1 1 1 1 0 1

0 1 1 0 0 1 0

调用X.toarray()即可得到上述结果.

在您发布的 print(X) 图像中,第一列表示 term-freq 矩阵的索引,第二列表示该术语的频率.

<0,0> 表示第一行、第一列,即词条 "disk"(我们令牌中的第一个词项) 在第一个文档中的频率 = 1

<0,2> 表示第一行,第三列,即词条 "hard"(我们的标记中的第三项) 在第一个文档中的频率 = 1

<0,5> 表示第一行,第六列,即第一个文档中术语 problems"(我们令牌中的第六个术语) 的频率 = 0.但是因为它是 0,所以它不会显示在您的图像中.

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)

The output is as follows

I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.

解决方案

The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.

As for how the calculation is done, you can have a look at the official documentation here.

The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.

Basically, the steps are as follow:

  • Step1 - Collect all different terms from all the documents present in fit().

    For your data, they are [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to'] This is available from vectorizer.get_feature_names()

  • Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.

    In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is

    [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

First 1 1 1 1 1 0 1

Sec 0 1 1 0 0 1 0

You can get the above result by calling X.toarray().

In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.

<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1

<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1

<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

这篇关于如何计算术语文档矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆