用python的NLTK计算动词,名词和其他词性 [英] Count verbs, nouns, and other parts of speech with python's NLTK
问题描述
我有多个文本,我想根据它们对名词和动词等各种词性的用法来创建它们的配置文件.基本上,我需要计算语音的每个部分被使用了多少次.
I have multiple texts and I would like to create profiles of them based on their usage of various parts of speech, like nouns and verbs. Basially, I need to count how many times each part of speech is used.
我已经标记了文本,但不确定如何进行进一步操作:
I have tagged the text but am not sure how to go further:
tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)
如何将语音的每个部分的计数保存到变量中?
How can I save the counts for each part of speech into a variable?
推荐答案
pos_tag
方法为您提供(令牌,标记)对的列表:
The pos_tag
method gives you back a list of (token, tag) pairs:
tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')]
如果您使用的是Python 2.7或更高版本,则可以使用以下方法轻松实现:
If you are using Python 2.7 or later, then you can do it simply with:
>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})
要对计数进行归一化(给每个计数的比例),请执行以下操作:
To normalize the counts (giving you the proportion of each) do:
>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}
请注意,在旧版本的Python中,您必须自己实现Counter
:
Note that in older versions of Python, you'll have to implement Counter
yourself:
>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
... counts[tag] += 1
>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})
这篇关于用python的NLTK计算动词,名词和其他词性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!