用python从语料库中提取最常用的单词 [英] Extracting most frequent words out of a corpus with python
问题描述
import re
进口字符串
来自nltk.corpus import bitwords
stoplist = stopwords.words('dutch')
从集合import defaultdict
从运算符import itemgetter
def toptenwords(mycorpus):
words = mycorpus.words()
no_capitals = set([word.lower()for word in words])
filtered = [word
no_punct = [s.translate(None,string.punctuation)for s in filtered]
wordcounter = {}
for no_punct中的单词:
如果wordcounter中的单词:
wordcounter [word] + = 1
else:
wordcounter [word] = 1
排序=排序(wordcounter.iteritems(),键= itemgetter,reverse = True)
返回排序
如果我用我的语料库打印此函数,它给了我一个清单所有的单词与'1'后面。它给了我一本字典,但我的所有价值都是一个。我知道,例如,宝贝这个词在我的语料库中是五六次,而且还给了宝贝:1...所以它不能像我想要的那样...
有人可以帮我吗?
无论如何使用NLTK,请尝试使用FreqDist(samples)函数首先从给定的样品。然后调用most_common(n)属性来查找样本中n个最常见的单词,按降序排序。如下:
从nltk.probability导入FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist .most_common(10)
Maybe this is a stupid question, but I have a problem with extracting the ten most frequent words out of a corpus with Python. This is what I've got so far. (btw, I work with NLTK for reading a corpus with two subcategories with each 10 .txt files)
import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')
from collections import defaultdict
from operator import itemgetter
def toptenwords(mycorpus):
words = mycorpus.words()
no_capitals = set([word.lower() for word in words])
filtered = [word for word in no_capitals if word not in stoplist]
no_punct = [s.translate(None, string.punctuation) for s in filtered]
wordcounter = {}
for word in no_punct:
if word in wordcounter:
wordcounter[word] += 1
else:
wordcounter[word] = 1
sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
return sorting
If I print this function with my corpus, it gives me a list of all words with '1' behind it. It gives me a dictionary but all my values are one. And I know that for example the word 'baby' is five or six times in my corpus... And still it gives 'baby: 1'... So it doesn't function the way I want...
Can someone help me?
If you're using the NLTK anyway, try the FreqDist(samples) function to first generate a frequency distribution from the given sample. Then call the most_common(n) attribute to find the n most common words in the sample, sorted by descending frequency. Something like:
from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)
这篇关于用python从语料库中提取最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!