使用NLTK在Python中计数短语 [英] Counting phrases in Python using NLTK
问题描述
我正在尝试从文本文件中获取短语计数,但到目前为止,我只能获取单词计数(请参见下文).我需要扩展此逻辑以计算两个单词的短语出现在文本文件中的次数.
I am trying to get a phrase count from a text file but so far I am only able to obtain a word count (see below). I need to extend this logic to count the number of times a two-word phrase appears in the text file.
据我所知,短语可以使用NLTK的逻辑进行定义/分组.我相信collections函数是获得所需结果所需要的,但是我不确定如何通过阅读NLTK文档来实现它.任何提示/帮助将不胜感激.
Phrases can be defined/grouped by using logic from NLTK from my understanding. I believe the collections function is what I need to obtain the desired result, but I'm not sure how to go about implementing it from reading the NLTK documentation. Any tips/help would be greatly appreciated.
import re
import string
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
推荐答案
nltk.brigrams
返回一对单词及其在特定文本中的出现频率.试试这个:
nltk.brigrams
returns a pair of words and their frequency in an specific text. Try this:
import nltk
from nltk import bigrams
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
tokens = word_tokenize(text_string)
result = bigrams(tokens)
输出:
[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]
这篇关于使用NLTK在Python中计数短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!