在Python 3中查找网站中最常见的单词 [英] Find Most Common Words from a Website in Python 3

查看:63
本文介绍了在Python 3中查找网站中最常见的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用Python 3代码查找并复制在给定网站上出现超过5次的单词,但我不确定该怎么做.我在这里查看了有关堆栈溢出的档案,但是其他解决方案依赖于python 2代码.这是我到目前为止拥有的简单代码:

I need to find and copy those words that appears over 5 times on a given website using Python 3 code and I'm not sure how to do it. I've looked through the archives here on stack overflow but other solutions rely on python 2 code. Here's the measly code I have so far:

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

有人对如何做有任何建议吗?我已经安装了NLTK,并且研究了漂亮的汤,但是为了我的生命,我不知道如何正确安装它(我非常蟒蛇皮!)!据我了解,任何解释也将不胜感激.谢谢:)

Does anyone have any advice on what to do? I have NLTK installed and I've looked into beautiful soup but for the life of me, I have no idea how to install it correctly (I'm very python-green)! As I am learning, any explanation would also be very much appreciated. Thank you :)

推荐答案

这不是完美的方法,但有一个关于如何开始使用 BeautifulSoup collections.Counter

This is not perfect but an idea of how to get you started using requests, BeautifulSoup and collections.Counter

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']

这篇关于在Python 3中查找网站中最常见的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆