使用 python 的 NLTK 计算动词、名词和其他词性 [英] Count verbs, nouns, and other parts of speech with python's NLTK

查看:31
本文介绍了使用 python 的 NLTK 计算动词、名词和其他词性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个文本,我想根据它们对不同词性(如名词和动词)的使用情况来创建它们的配置文件.基本上,我需要计算每个词性使用了多少次.

我已经标记了文本,但不知道如何进一步:

tokens = nltk.word_tokenize(text.lower())文本 = nltk.Text(令牌)标签 = nltk.pos_tag(文本)

如何将每个词性的计数保存到变量中?

解决方案

pos_tag 方法为您返回一个 (token, tag) 对列表:

tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('猫', 'NN')]

如果您使用的是 Python 2.7 或更高版本,那么您可以简单地使用:

<预><代码>>>>从集合导入计数器>>>计数 = 计数器(单词标记,标记中的标记)>>>计数计数器({'DT':2,'NN':2,'VB':1})

要标准化计数(为您提供每个的比例),请执行以下操作:

<预><代码>>>>总计 = sum(counts.values())>>>dict((word, float(count)/total) for word,count in counts.items()){'DT':0.4,'VB':0.2,'NN':0.4}

请注意,在旧版本的 Python 中,您必须自己实现 Counter:

<预><代码>>>>从集合导入 defaultdict>>>计数 = defaultdict(int)>>>对于单词,标记为:...计数[标签] += 1>>>计数defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})

I have multiple texts and I would like to create profiles of them based on their usage of various parts of speech, like nouns and verbs. Basially, I need to count how many times each part of speech is used.

I have tagged the text but am not sure how to go further:

tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)

How can I save the counts for each part of speech into a variable?

解决方案

The pos_tag method gives you back a list of (token, tag) pairs:

tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')] 

If you are using Python 2.7 or later, then you can do it simply with:

>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})

To normalize the counts (giving you the proportion of each) do:

>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}

Note that in older versions of Python, you'll have to implement Counter yourself:

>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
...  counts[tag] += 1

>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})

这篇关于使用 python 的 NLTK 计算动词、名词和其他词性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆