带有NLTK的FreqDist [英] FreqDist with NLTK

查看:93
本文介绍了带有NLTK的FreqDist的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

NLTK 具有函数 FreqDist 可为您提供文本中单词的出现频率.我正在尝试将我的文本作为参数传递,但结果的格式为:

NLTK in python has a function FreqDist which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

而在 NLTK 网站的示例中,结果是整个单词,而不仅仅是字母.我这样做是这样的:

whereas in the example in the NLTK website the result was whole words not just letters. Im doing it this way:

file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]

您知道我有错吗?谢谢!

DO you know what I have wrong pls? Thanks!

推荐答案

FreqDist要求可迭代的令牌.字符串是可迭代的---迭代器产生每个字符.

FreqDist expects an iterable of tokens. A string is iterable --- the iterator yields every character.

首先将您的文本传递给令牌生成器,然后将令牌传递给FreqDist.

Pass your text to a tokenizer first, and pass the tokens to FreqDist.

这篇关于带有NLTK的FreqDist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆