用python从语料库中提取最常用的单词 [英] Extracting most frequent words out of a corpus with python

查看:727
本文介绍了用python从语料库中提取最常用的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许这是一个愚蠢的问题,但是我有一个问题,用Python从语料库中提取十个最常用的单词。这是我到目前为止。 (btw,我与NLTK一起使用每个10个.txt文件读取两个子类别的语料库)

  import re 
进口字符串
来自nltk.corpus import bitwords
stoplist = stopwords.words('dutch')

从集合import defaultdict
从运算符import itemgetter

def toptenwords(mycorpus):
words = mycorpus.words()
no_capitals = set([word.lower()for word in words])
filtered = [word
no_punct = [s.translate(None,string.punctuation)for s in filtered]
wordcounter = {}
for no_punct中的单词:
如果wordcounter中的单词:
wordcounter [word] + = 1
else:
wordcounter [word] = 1
排序=排序(wordcounter.iteritems(),键= itemgetter,reverse = True)
返回排序

如果我用我的语料库打印此函数,它给了我一个清单所有的单词与'1'后面。它给了我一本字典,但我的所有价值都是一个。我知道,例如,宝贝这个词在我的语料库中是五六次,而且还给了宝贝:1...所以它不能像我想要的那样...

有人可以帮我吗?

解决方案

无论如何使用NLTK,请尝试使用FreqDist(samples)函数首先从给定的样品。然后调用most_common(n)属性来查找样本中n个最常见的单词,按降序排序。如下:

 从nltk.probability导入FreqDist 
fdist = FreqDist(stoplist)
top_ten = fdist .most_common(10)


Maybe this is a stupid question, but I have a problem with extracting the ten most frequent words out of a corpus with Python. This is what I've got so far. (btw, I work with NLTK for reading a corpus with two subcategories with each 10 .txt files)

import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')

from collections import defaultdict
from operator import itemgetter

def toptenwords(mycorpus):
    words = mycorpus.words()
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist]
    no_punct = [s.translate(None, string.punctuation) for s in filtered] 
    wordcounter = {}
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
    return sorting 

If I print this function with my corpus, it gives me a list of all words with '1' behind it. It gives me a dictionary but all my values are one. And I know that for example the word 'baby' is five or six times in my corpus... And still it gives 'baby: 1'... So it doesn't function the way I want...
Can someone help me?

解决方案

If you're using the NLTK anyway, try the FreqDist(samples) function to first generate a frequency distribution from the given sample. Then call the most_common(n) attribute to find the n most common words in the sample, sorted by descending frequency. Something like:

from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)

这篇关于用python从语料库中提取最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆