使用Python的文本中的文字频率，但忽略停用词 [英] Word Frequency in text using Python but disregard stop words

查看：312 发布时间：2018/5/4 11:23:42 python google-app-engine frequency-analysis word-frequency

本文介绍了使用Python的文本中的文字频率，但忽略停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  fullWords = re.findall（r'\w +'）这给了我一段文字的频率： ，allText）
 
d = defaultdict（int）
 
全文中的单词：
d [word] + = 1 
 
 finalFreq = sorted （d.iteritems（），key = operator.itemgetter（1），reverse = True）
 
 self.response.out.write（finalFreq）

这也给了我无用的单词，比如theana

我的问题是，是否有一个停止词库在Python中可以删除所有这些常见的单词？我想在谷歌应用程序引擎上运行这个功能。

从 here - 所有Python需要做的就是读取文件（这些都是 csv 格式，可以通过 csv 模块轻松读取），创建一个集合，并使用设置（可能带有一些标准化，例如小写），以排除计数中的单词。

This gives me a frequency of words in a text:
fullWords = re.findall(r'\w+', allText) d = defaultdict(int) for word in fullWords : d[word] += 1 finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True) self.response.out.write(finalFreq)
This also gives me useless words like "the" "an" "a"

My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine
解决方案
You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv format, easily read with the csv module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.

这篇关于使用Python的文本中的文字频率，但忽略停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python的文本中的文字频率，但忽略停用词 [英] Word Frequency in text using Python but disregard stop words

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python的文本中的文字频率，但忽略停用词 [英] Word Frequency in text using Python but disregard stop words

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭