返回网站中最常见的单词,以使单词数> 5 [英] Return most common words in a website, such that word count >5

查看:83
本文介绍了返回网站中最常见的单词,以使单词数> 5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手。我有一个简单的程序来查找一个网站中单词的使用次数。

I am new to python. I have a simple program to find the number of times a word has been used in a website.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
word_counts = Counter()
stopwords = frozenset(('A', 'AN', 'THE'))


for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words if not word.upper() in stopwords]
    word_counts.update(cap_words)

print word_counts

这个脚本提供了很多东西只使用一次的单词。如何更新脚本,使包含的单词至少有5个单词数。

Thing is, that this script gives a lot of words which are used only once. How can I update the script so that the word included, has at least 5 word count.

我也该如何排列前5个最常见的单词,例如word1,word2,word3 ...等等。

Also how can I arrange the top 5 most common words, into say word1, word2, word3.... etc.

推荐答案


我如何更新脚本,使包含的单词至少有5个
字数。

How can i update the script so that the word included, has atleast 5 word count.

您可以按以下方式过滤计数器: filter(lambda x:x [1]> 5,word_counts.iteritems())

You can filter the Counter as follows: filter(lambda x: x[1] > 5, word_counts.iteritems())

filter()接受一个函数并迭代,将函数应用于每个可迭代的元素,并且仅在函数返回 True 的情况下才将其包含在输出中。 iteritems()返回一个生成器,该生成器通过字典生成键,值对。

filter() takes a function and an iterable, applies the function to each element of the iterable, and only includes that item in the output if the function returned True. iteritems() returns a generator which yields key, value pairs over a dictionary.


如何将前5个最常用的单词排列成单词1,单词2,
单词3...。等等。

how can i arrange the top 5 most common words, into say word1, word2, word3.... etc.

有一个 most_common(n)计数器函数。参见 http://docs.python.org/2/library/collections.html

There is a most_common(n) Counter function. See http://docs.python.org/2/library/collections.html

这篇关于返回网站中最常见的单词,以使单词数> 5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆