用语料库计算 tf-idf [英] compute tf-idf with corpus

查看:50
本文介绍了用语料库计算 tf-idf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我复制了一份关于如何创建一个可以运行 tf-idf 的系统的源代码,代码如下:

So, I have copied a source code about how to create a system that can run tf-idf, and here is the code :

    #module import
    from __future__ import division, unicode_literals
    import math
    import string
    import re
    import os

    from text.blob import TextBlob as tb
    #create a new array
    words = {} 
    def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

    def n_containing(word, bloblist):
       return sum(1 for blob in bloblist if word in blob)

    def idf(word, bloblist):
       return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

    def tfidf(word, blob, bloblist):
       return tf(word, blob) * idf(word, bloblist)

    regex = re.compile('[%s]' % re.escape(string.punctuation))

    f = open('D:/article/sport/a.txt','r')
    var = f.read()
    var = regex.sub(' ', var)
    var = var.lower()

    document1 = tb(var)

    f = open('D:/article/food/b.txt','r')
    var = f.read()
    var = var.lower()
    document2 = tb(var)


    bloblist = [document1, document2]
    for i, blob in enumerate(bloblist):
       print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:50]:
    print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))

但是,问题是,我想将运动文件夹中的所有文件放在一个语料库中,并且将food文件夹中的food文章放入另一个语料库,系统会给出每个语料库的结果.现在,我只能比较文件,但我想比较语料库之间的比较.我很抱歉问这个问题,任何帮助都会得到帮助.

but, the problem is, i want to put all of the files in the sport folder in a corpora, and the food article in the food folder into another corpora, so the the system will give a result for each corpora. Now, i can only compare files, but i want to compare between corpora.I am very sorry for asking this question, any help will be appriciated.

谢谢

推荐答案

我得到的是,你想计算两个文件的词频并将它们存储在不同的文件中进行比较,为此,你可以使用终端.下面是计算词频的简单代码

What i got is that, you want to calculate the word frequency of two file and store them in different file to compare them, to do this , you can use terminal. Here is the simple code to calculate the word frequency

import string
import collections
import operator
keywords = []
i=0
def removePunctuation(sentence):
    sentence = sentence.lower()
    new_sentence = ""
    for char in sentence:
        if char not in string.punctuation:
                new_sentence = new_sentence + char
    return new_sentence
 def wordFrequences(sentence):
    global i
    wordFreq = {}
    split_sentence = new_sentence.split()
    for word in split_sentence:
        wordFreq[word] = wordFreq.get(word,0) + 1
    wordFreq.items()
  # od = collections.OrderedDict(sorted(wordFreq.items(),reverse=True))
  # print od
    sorted_x= sorted(wordFreq.iteritems(), key=operator.itemgetter(1),reverse = True)
    print sorted_x
    for key, value in sorted_x:
        keywords.append(key)
    print keywords
f = open('D:/article/sport/a.txt','r')
sentence = f.read()
# sentence = "The first test of the function some some some some"
new_sentence = removePunctuation(sentence)
wordFrequences(new_sentence)

您必须通过更改文本文件的路径两次运行此代码,并且每次从控制台传递命令运行代码时,就像这样

you have to run this code two time by changing the path of your text file and each time when you run code from console pass command like this

python abovecode.py > destinationfile.txt

就像你的情况

python abovecode.py > sportfolder/file1.txt
python abovecode.py > foodfolder/file2.txt

imp : 如果你想要单词的频率,则省略部分

imp : if u want the words with their frequency then omit the part

print keywords

imp : 如果你需要单词 acc.到他们的频率然后省略

imp : if u need words acc. to their freq then omit

print sorted_x

这篇关于用语料库计算 tf-idf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆