使用Python的文本中的文字频率,但忽略停用词 [英] Word Frequency in text using Python but disregard stop words
本文介绍了使用Python的文本中的文字频率,但忽略停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
fullWords = re.findall(r'\w +')这给了我一段文字的频率: ,allText)
d = defaultdict(int)
全文中的单词:
d [word] + = 1
finalFreq = sorted (d.iteritems(),key = operator.itemgetter(1),reverse = True)
self.response.out.write(finalFreq)
这也给了我无用的单词,比如theana
我的问题是,是否有一个停止词库在Python中可以删除所有这些常见的单词?我想在谷歌应用程序引擎上运行这个功能。
从 here - 所有Python需要做的就是读取文件(这些都是 csv
格式,可以通过 csv
模块轻松读取),创建一个集合,并使用设置(可能带有一些标准化,例如小写),以排除计数中的单词。 This gives me a frequency of words in a text:
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords :
d[word] += 1
finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True)
self.response.out.write(finalFreq)
This also gives me useless words like "the" "an" "a"
My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine
解决方案
You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv
format, easily read with the csv
module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.
这篇关于使用Python的文本中的文字频率,但忽略停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文