计算网页内的单词 [英] counting words inside a webpage

查看:29
本文介绍了计算网页内的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 python3 计算网页中的单词.我应该使用哪个模块?网址库?

I need to count words that are inside a webpage using python3. Which module should I use? urllib?

这是我的代码:

def web():
    f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
    lu = f.read()
    print(lu)

推荐答案

通过下面的自我解释代码,您可以获得一个很好的起点来计算网页中的字数:

With below self explained code you can get a good starting point for counting words within a web page:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common() 

例如,如果您想要前 10 个最常用的单词:

If you want for example the first 10 most common words you just do:

total.most_common(10)

在这种情况下输出:

In [100]: total.most_common(10)
Out[100]: 
[('the', 2097),
 ('and', 1651),
 ('of', 998),
 ('in', 625),
 ('i', 592),
 ('a', 529),
 ('to', 529),
 ('that', 426),
 ('is', 369),
 ('my', 365)]

这篇关于计算网页内的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆