使用NLTK在Python中计数短语 [英] Counting phrases in Python using NLTK

查看:164
本文介绍了使用NLTK在Python中计数短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从文本文件中获取短语计数,但到目前为止,我只能获取单词计数(请参见下文).我需要扩展此逻辑以计算两个单词的短语出现在文本文件中的次数.

I am trying to get a phrase count from a text file but so far I am only able to obtain a word count (see below). I need to extend this logic to count the number of times a two-word phrase appears in the text file.

据我所知,短语可以使用NLTK的逻辑进行定义/分组.我相信collections函数是获得所需结果所需要的,但是我不确定如何通过阅读NLTK文档来实现它.任何提示/帮助将不胜感激.

Phrases can be defined/grouped by using logic from NLTK from my understanding. I believe the collections function is what I need to obtain the desired result, but I'm not sure how to go about implementing it from reading the NLTK documentation. Any tips/help would be greatly appreciated.

import re
import string
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1

frequency_list = frequency.keys()

for words in frequency_list:
    print (words, frequency[words])

推荐答案

nltk.brigrams返回一对单词及其在特定文本中的出现频率.试试这个:

nltk.brigrams returns a pair of words and their frequency in an specific text. Try this:

import nltk
from nltk import bigrams

document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
tokens = word_tokenize(text_string)
result = bigrams(tokens)

输出:

[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]

这篇关于使用NLTK在Python中计数短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆