用python从语料库中提取最常用的单词 [英] Extracting most frequent words out of a corpus with python

查看：727 发布时间：2017/5/21 19:02:54 python dictionary frequency word-count

本文介绍了用python从语料库中提取最常用的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

也许这是一个愚蠢的问题，但是我有一个问题，用Python从语料库中提取十个最常用的单词。这是我到目前为止。（btw，我与NLTK一起使用每个10个.txt文件读取两个子类别的语料库）

  import re 
进口字符串
来自nltk.corpus import bitwords 
 stoplist = stopwords.words（'dutch'）
 
从集合import defaultdict 
从运算符import itemgetter 
 
 def toptenwords（mycorpus）：
 words = mycorpus.words（）
 no_capitals = set（[word.lower（）for word in words]）
 filtered = [word 
 no_punct = [s.translate（None，string.punctuation）for s in filtered] 
 wordcounter = {} 
 for no_punct中的单词：
如果wordcounter中的单词：
 wordcounter [word] + = 1 
 else：
 wordcounter [word] = 1 
排序=排序（wordcounter.iteritems（），键= itemgetter，reverse = True）
返回排序

如果我用我的语料库打印此函数，它给了我一个清单所有的单词与'1'后面。它给了我一本字典，但我的所有价值都是一个。我知道，例如，宝贝这个词在我的语料库中是五六次，而且还给了宝贝：1...所以它不能像我想要的那样...

有人可以帮我吗？

解决方案

无论如何使用NLTK，请尝试使用FreqDist（samples）函数首先从给定的样品。然后调用most_common（n）属性来查找样本中n个最常见的单词，按降序排序。如下：

 从nltk.probability导入FreqDist 
 fdist = FreqDist（stoplist）
 top_ten = fdist .most_common（10）

Maybe this is a stupid question, but I have a problem with extracting the ten most frequent words out of a corpus with Python. This is what I've got so far. (btw, I work with NLTK for reading a corpus with two subcategories with each 10 .txt files)

import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')

from collections import defaultdict
from operator import itemgetter

def toptenwords(mycorpus):
    words = mycorpus.words()
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist]
    no_punct = [s.translate(None, string.punctuation) for s in filtered] 
    wordcounter = {}
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
    return sorting

If I print this function with my corpus, it gives me a list of all words with '1' behind it. It gives me a dictionary but all my values are one. And I know that for example the word 'baby' is five or six times in my corpus... And still it gives 'baby: 1'... So it doesn't function the way I want...
Can someone help me?

解决方案

If you're using the NLTK anyway, try the FreqDist(samples) function to first generate a frequency distribution from the given sample. Then call the most_common(n) attribute to find the n most common words in the sample, sorted by descending frequency. Something like:

from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)

这篇关于用python从语料库中提取最常用的单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用python从语料库中提取最常用的单词 [英] Extracting most frequent words out of a corpus with python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用python从语料库中提取最常用的单词 [英] Extracting most frequent words out of a corpus with python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭