nltk函数对某些单词的出现进行计数 [英] nltk function to count occurrences of certain words

查看:116
本文介绍了nltk函数对某些单词的出现进行计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在nltk书中有一个问题 使用state_union语料库阅读器阅读国情咨文的文本.计算每个文档中男女的出现次数.随着时间的流逝,这些单词的使用又发生了什么?"

In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"

我认为我可以使用state_union('1945-Truman.txt').count('men')之类的函数 但是,在此州议会草案中有60多种文本,我觉得必须有一种更简单的方法来查看每个单词的字数,而不是一遍又一遍地重复此功能.

I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

推荐答案

您可以在语料库中使用.words()函数返回字符串列表(即标记/单词):

You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):

>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

然后使用Counter()对象对实例进行计数,请参见 https ://docs.python.org/2/library/collections.html#collections.Counter :

Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:

>>> wordcounts = Counter(brown.words())

但是请注意,Counter区分大小写,请参阅:

But do note that the Counter is case-sensitive, see:

>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971

这篇关于nltk函数对某些单词的出现进行计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆