使用Python的文本中的文字频率,但忽略停用词 [英] Word Frequency in text using Python but disregard stop words

查看:312
本文介绍了使用Python的文本中的文字频率,但忽略停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  fullWords = re.findall(r'\w +')这给了我一段文字的频率: ,allText)

d = defaultdict(int)

全文中的单词:
d [word] + = 1

finalFreq = sorted (d.iteritems(),key = operator.itemgetter(1),reverse = True)

self.response.out.write(finalFreq)

这也给了我无用的单词,比如theana



我的问题是,是否有一个停止词库在Python中可以删除所有这些常见的单词?我想在谷歌应用程序引擎上运行这个功能。

here - 所有Python需要做的就是读取文件(这些都是 csv 格式,可以通过 csv 模块轻松读取),创建一个集合,并使用设置(可能带有一些标准化,例如小写),以排除计数中的单词。


This gives me a frequency of words in a text:

 fullWords = re.findall(r'\w+', allText)

 d = defaultdict(int)

 for word in fullWords :
          d[word] += 1

 finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True)

 self.response.out.write(finalFreq)

This also gives me useless words like "the" "an" "a"

My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine

解决方案

You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv format, easily read with the csv module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.

这篇关于使用Python的文本中的文字频率,但忽略停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆