使用python 2.7计算文档之间的tf-idf [英] Calculating tf-idf among documents using python 2.7

查看:169
本文介绍了使用python 2.7计算文档之间的tf-idf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一种情况,我从互联网检索信息/原始数据,并将它们放入各自的json或.txt文件中.

从那时起,我想使用tf-idf计算每个文档中每个术语的频率以及它们的余弦相似度.

例如: 有50个不同的文档/文本文件,每个文件包含5000个单词/字符串 我想从第一个文档/文本中取出第一个单词,比较所有250000个单词的频率,然后对第二个单词进行搜索,以此类推,对所有50个文档/文本都如此.

每个频率的预期输出将为0 -1

我该怎么做.我一直在指代sklear软件包,但是在每次比较中,它们大多数只包含几个字符串.

解决方案

您确实应该向我们展示您的代码,并更详细地说明遇到问题的部分.

您所描述的通常不是如何完成的.通常要做的是对文档进行矢量化处理,然后对矢量进行比较,从而得出此模型下任意两个文档之间的相似性.由于您正在询问NLTK,因此我假设您需要这种常规的传统方法.

无论如何,使用传统的单词表示法,两个单词之间的余弦相似度是没有意义的-两个单词相同或不同.但是,当然还有其他方法可以处理术语相似性或文档相似性.

复制 https://stackoverflow.com/a/23796566/874188 中的代码,因此我们有了一个基准:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

这里没有任何内容取决于输入的长度.如果文档越长,idf中的功能数量将越大;如果文档越多,语料库中的功能就越多,但是这样的算法完全不需要更改即可容纳更多或更长的文档.

如果您不想了解原因,可以在这里停止阅读.

向量基本上是每个单词形式的计数数组.每个向量的长度是单词形式的数量(即特征的数量).因此,如果您有一个包含六个条目的词典,如下所示:

0: a
1: aardvark
2: banana
3: fruit
4: flies
5: like

然后输入文档一种像香蕉一样的果蝇"将产生一个包含六个元素的向量,如下所示:

[2, 0, 1, 1, 1, 1]

因为该词在词典中的索引为零的位置出现了两次,在索引为1的单词出现了为零,在索引为2的单词出现了一个,等等,这是TF(项频率)向量.它已经是一个有用的载体.您可以使用余弦距离比较它们中的两个,并获得它们相似度的度量.

IDF因子的目的是对此进行归一化.标准化带来了三个好处:在计算上,您不需要进行任何按文档或按比较的标准化,因此速度更快;并且该算法还对频繁出现的单词进行了归一化处理,因此,如果大多数文档都包含多次出现的"a",则许多出现的"a"被适当地认为是无关紧要的(因此,您不必进行显式的停用词过滤),而多次出现的"aardvark"在归一化向量中立即明显显着.另外,归一化的输出可以很容易地解释,而对于普通的TF向量,则必须考虑文档长度等因素才能正确理解余弦相似度比较的结果.

因此,如果"a"的DF(文档频率)为1000,而词典中其他单词的DF为1,则缩放矢量将为

[0.002, 0, 1, 1, 1, 1]

(因为我们采用文档频率的倒数,即TF("a")* IDF("a")= TF("a")/DF("a")= 2/1000).

余弦相似度基本上在 n 维空间(此处为 n = 6)中解释这些向量,并观察它们的箭头彼此相距多远.为了简单起见,让我们将其缩小为三个维度,并在X轴上绘制(按IDF缩放比例)"a"的数量,在Y轴上绘制"aardvark"的数量以及香蕉"的数量在Z轴上终点[0.002,0,1]与[0.003,0,1]仅有一点点差异,而[0,1,0]终止于我们正在想象的立方体的另一个角,因此余弦距离大. (归一化意味着1.0是任何元素的最大值,因此我们所说的实际上是一个角.)

现在,返回到词典,如果您添加一个新文档,并且该文档中的单词不在词典中,那么它们将被添加到词典中,因此从现在开始,矢量将需要更长的时间. (您已经创建的向量现在太短了,可以对其进行简单的扩展;迄今为止看不见的术语的术语权重显然总是为零.)如果将文档添加到语料库,则语料库中还会有一个向量进行比较反对.但是算法不需要改变.它将始终创建每个词典条目只有一个元素的向量,并且您可以继续使用与以前相同的方法来比较这些向量.

您当然可以循环使用这些术语,并为每个术语合成仅包含单个术语的文档".将其与其他单个术语文档"进行比较将产生与其他单个术语文档"的0.0相似度(或与包含相同术语且仅包含其他术语的文档的1.0相似度),因此并没有太大用处,但与实际文档的比较将本质上揭示出每个文档的比例由您要检查的术语组成.

原始IDF向量告诉您每个术语的相对频率.它通常表示每个术语出现在多少个文档中(因此,即使一个术语在一个文档中出现多次,该术语在DF上也只会加1),尽管某些实现也允许您使用裸术语计数. /p>

I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files.

From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf.

For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do so for the second word and so on for all 50 documents/texts.

Expected output of each frequecy will be from 0 -1

How am i able to do so. I have been referring to sklear package but most of them only consists of a few strings in each comparisons.

解决方案

You really should show us your code and explain in more detail which part it is that you are having trouble with.

What you describe is not usually how it's done. What you usually do is vectorize documents, then compare the vectors, which yields the similarity between any two documents under this model. Since you are asking about NLTK, I will proceed on the assumption that you want this regular, traditional method.

Anyway, with a traditional word representation, cosine similarity between two words is meaningless -- either two words are identical, or they're not. But there are certainly other ways you could approach term similarity or document similarity.

Copying the code from https://stackoverflow.com/a/23796566/874188 so we have a baseline:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

There is nothing here which depends on the length of the input. The number of features in idf will be larger if you have longer documents and there will be more of them in the corpus if you have more documents, but the algorithm as such will not need to change at all to accommodate more or longer documents.

If you don't want to understand why, you can stop reading here.

The vectors are basically an array of counts for each word form. The length of each vector is the number of word forms (i.e. the number of features). So if you have a lexicon with six entries like this:

0: a
1: aardvark
2: banana
3: fruit
4: flies
5: like

then the input document "a fruit flies like a banana" will yield a vector of six elements like this:

[2, 0, 1, 1, 1, 1]

because there are two occurrences of the word at index zero in the lexicon, zero occurrences of the word at index one, one of the one at index two, etc. This is a TF (term frequency) vector. It is already a useful vector; you can compare two of them using cosine distance, and obtain a measurement of their similarity.

The purpose of the IDF factor is to normalize this. The normalization brings three benefits; computationally, you don't need to do any per-document or per-comparison normalization, so it's faster; and the algorithm also normalizes frequent words so that many occurrences of "a" is properly regarded as insignificant if most documents contain many occurrences of this word (so you don't have to do explicit stop word filtering), whereas many occurrences of "aardvark" is immediately obviously significant in the normalized vector. Also, the normalized output can be readily interpreted, whereas with plain TF vectors you have to take document length etc. into account to properly understand the result of the cosine similarity comparison.

So if the DF (document frequency) of "a" is 1000, and the DF of the other words in the lexicon is 1, the scaled vector will be

[0.002, 0, 1, 1, 1, 1]

(because we take the inverse of the document frequency, i.e. TF("a")*IDF("a") = TF("a")/DF("a") = 2/1000).

The cosine similarity basically interprets these vectors in an n-dimensional space (here, n=6) and sees how far from each other their arrows are. Just for simplicity, let's scale this down to three dimensions, and plot the (IDF-scaled) number of "a" on the X axis, the number of "aardvark" occurrences on the Y axis, and the number of "banana" occurrences on the Z axis. The end point [0.002, 0, 1] differs from [0.003, 0, 1] by just a tiny bit, whereas [0, 1, 0] ends up at quite another corner of the cube we are imagining, so the cosine distance is large. (The normalization means 1.0 is the maximum of any element, so we are talking literally a corner.)

Now, returning to the lexicon, if you add a new document and it has words which are not already in the lexicon, they will be added to the lexicon, and so the vectors will need to be longer from now on. (Vectors you already created which are now too short can be trivially extended; the term weight for the hitherto unseen terms will obviously always be zero.) If you add the document to the corpus, there will be one more vector in the corpus to compare against. But the algorithm doesn't need to change; it will always create vectors with one element per lexicon entry, and you can continue to compare these vectors using the same methods as before.

You can of course loop over the terms and for each, synthesize a "document" consisting of just that single term. Comparing it to other single-term "documents" will yield 0.0 similarity to the others (or 1.0 similarity to a document containing the same term and nothing else), so that's not too useful, but a comparison against real-world documents will reveal essentially what proportion of each document consists of the term you are examining.

The raw IDF vector tells you the relative frequency of each term. It usually expresses how many documents each term occurred in (so even if a term occurs more than once in a document, it only adds 1 to the DF for this term), though some implementations also allow you to use the bare term count.

这篇关于使用python 2.7计算文档之间的tf-idf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆