用语料库计算 tf-idf [英] compute tf-idf with corpus
问题描述
所以,我复制了一份关于如何创建一个可以运行 tf-idf 的系统的源代码,代码如下:
So, I have copied a source code about how to create a system that can run tf-idf, and here is the code :
#module import
from __future__ import division, unicode_literals
import math
import string
import re
import os
from text.blob import TextBlob as tb
#create a new array
words = {}
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
regex = re.compile('[%s]' % re.escape(string.punctuation))
f = open('D:/article/sport/a.txt','r')
var = f.read()
var = regex.sub(' ', var)
var = var.lower()
document1 = tb(var)
f = open('D:/article/food/b.txt','r')
var = f.read()
var = var.lower()
document2 = tb(var)
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:50]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
但是,问题是,我想将运动文件夹中的所有文件放在一个语料库中,并且将food文件夹中的food文章放入另一个语料库,系统会给出每个语料库的结果.现在,我只能比较文件,但我想比较语料库之间的比较.我很抱歉问这个问题,任何帮助都会得到帮助.
but, the problem is, i want to put all of the files in the sport folder in a corpora, and the food article in the food folder into another corpora, so the the system will give a result for each corpora. Now, i can only compare files, but i want to compare between corpora.I am very sorry for asking this question, any help will be appriciated.
谢谢
推荐答案
我得到的是,你想计算两个文件的词频并将它们存储在不同的文件中进行比较,为此,你可以使用终端.下面是计算词频的简单代码
What i got is that, you want to calculate the word frequency of two file and store them in different file to compare them, to do this , you can use terminal. Here is the simple code to calculate the word frequency
import string
import collections
import operator
keywords = []
i=0
def removePunctuation(sentence):
sentence = sentence.lower()
new_sentence = ""
for char in sentence:
if char not in string.punctuation:
new_sentence = new_sentence + char
return new_sentence
def wordFrequences(sentence):
global i
wordFreq = {}
split_sentence = new_sentence.split()
for word in split_sentence:
wordFreq[word] = wordFreq.get(word,0) + 1
wordFreq.items()
# od = collections.OrderedDict(sorted(wordFreq.items(),reverse=True))
# print od
sorted_x= sorted(wordFreq.iteritems(), key=operator.itemgetter(1),reverse = True)
print sorted_x
for key, value in sorted_x:
keywords.append(key)
print keywords
f = open('D:/article/sport/a.txt','r')
sentence = f.read()
# sentence = "The first test of the function some some some some"
new_sentence = removePunctuation(sentence)
wordFrequences(new_sentence)
您必须通过更改文本文件的路径两次运行此代码,并且每次从控制台传递命令运行代码时,就像这样
you have to run this code two time by changing the path of your text file and each time when you run code from console pass command like this
python abovecode.py > destinationfile.txt
就像你的情况
python abovecode.py > sportfolder/file1.txt
python abovecode.py > foodfolder/file2.txt
imp : 如果你想要单词的频率,则省略部分
imp : if u want the words with their frequency then omit the part
print keywords
imp : 如果你需要单词 acc.到他们的频率然后省略
imp : if u need words acc. to their freq then omit
print sorted_x
这篇关于用语料库计算 tf-idf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!