如何使用nltk计算文本中存在的单词的频率 [英] How to count the frequency of words existing in a text using nltk

查看:97
本文介绍了如何使用nltk计算文本中存在的单词的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python脚本,可以读取文本并应用预处理功能以进行分析.
问题是我想计算单词的出现频率,但是系统崩溃并显示以下错误.

I have a python script that reads the text and applies preprocess functions in order to do the analysis.
The problem is that I want to count the frequency of words but the system crash and displays the below error.

在tag_and_save中的文件"F:\ AIenv \ textAnalysis \ setup.py",第208行 file.write(word +"/" + tag +(frequency =" + str(freq_tagged_data [word])+)\ n")TypeError:元组 索引必须是整数或切片,而不是str

File "F:\AIenv\textAnalysis\setup.py", line 208, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n") TypeError: tuple indices must be integers or slices, not str

我试图计算频率,然后写到text file.

I am trying to count the frequency and then write to a text file.

def get_freq(tagged):
    freqs = FreqDist(tagged)
    for word, freq in freqs.items():
        print(word, freq)
    result = word,freq
    return result

def tag_and_save(tagger,text,path):
    clt = clean_text(text)
    tagged_data = tagger.tag(clt)

    freq_tagged_data = get_freq(tagged_data)
    file = open(path,"w",encoding = "UTF8")
    for word,tag in tagged_data:
        file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")
    file.close()

我希望这样的输出:

('*****/DTNN')3

('*****/DTNN') 3


基于


based on the answer of

我将函数 get_freq()更改为:

def get_freq(tagged):
    freq_dist = {}
    freqs = FreqDist(tagged)
    freq_dist = [(word, freq) for word ,freq in freqs.items()]
    return freq_dist

但是现在它显示以下错误:

but now it display the below error :

tag_and_save中的文件"F:\ AIenv \ textAnalysis \ setup.py",第217行 file.write(word +"/" + tag +(frequency =" + str(freq_tagged_data [word])+)\ n")

File "F:\AIenv\textAnalysis\setup.py", line 217, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")

TypeError:列表

TypeError: list

索引必须是整数或切片,而不是str

indices must be integers or slices, not str

如何解决此错误,我该怎么办?

How to fix this error and what should I do?

推荐答案

也许这会有所帮助.

import nltk
text = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."
tokens = [t for t in text.split()]
freqs = nltk.FreqDist(tokens)
blah_list = [(k, v) for k, v in freqs.items()]
print(blah_list)

此代码段计算单词频率.

This snippet counts the word frequency.

代码现在可以使用.

这篇关于如何使用nltk计算文本中存在的单词的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆