如何使用nltk计算文本中存在的单词的频率 [英] How to count the frequency of words existing in a text using nltk
问题描述
我有一个python脚本,可以读取文本并应用预处理功能以进行分析.
问题是我想计算单词的出现频率,但是系统崩溃并显示以下错误.
I have a python script that reads the text and applies preprocess functions in order to do the analysis.
The problem is that I want to count the frequency of words but the system crash and displays the below error.
在tag_and_save中的文件"F:\ AIenv \ textAnalysis \ setup.py",第208行 file.write(word +"/" + tag +(frequency =" + str(freq_tagged_data [word])+)\ n")TypeError:元组 索引必须是整数或切片,而不是str
File "F:\AIenv\textAnalysis\setup.py", line 208, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n") TypeError: tuple indices must be integers or slices, not str
我试图计算频率,然后写到text file
.
I am trying to count the frequency and then write to a text file
.
def get_freq(tagged):
freqs = FreqDist(tagged)
for word, freq in freqs.items():
print(word, freq)
result = word,freq
return result
def tag_and_save(tagger,text,path):
clt = clean_text(text)
tagged_data = tagger.tag(clt)
freq_tagged_data = get_freq(tagged_data)
file = open(path,"w",encoding = "UTF8")
for word,tag in tagged_data:
file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")
file.close()
我希望这样的输出:
('*****/DTNN')3
('*****/DTNN') 3
基于
based on the answer of
我将函数 get_freq()更改为:
def get_freq(tagged):
freq_dist = {}
freqs = FreqDist(tagged)
freq_dist = [(word, freq) for word ,freq in freqs.items()]
return freq_dist
但是现在它显示以下错误:
but now it display the below error :
tag_and_save中的文件"F:\ AIenv \ textAnalysis \ setup.py",第217行 file.write(word +"/" + tag +(frequency =" + str(freq_tagged_data [word])+)\ n")
File "F:\AIenv\textAnalysis\setup.py", line 217, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")
TypeError:列表
TypeError: list
索引必须是整数或切片,而不是str
indices must be integers or slices, not str
如何解决此错误,我该怎么办?
How to fix this error and what should I do?
推荐答案
也许这会有所帮助.
import nltk
text = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."
tokens = [t for t in text.split()]
freqs = nltk.FreqDist(tokens)
blah_list = [(k, v) for k, v in freqs.items()]
print(blah_list)
此代码段计算单词频率.
This snippet counts the word frequency.
代码现在可以使用.
这篇关于如何使用nltk计算文本中存在的单词的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!