Python网络刮,计数每一页的字的列表的发生 [英] Python web scraping, counting the occurrence of a list of words of each page
本文介绍了Python网络刮,计数每一页的字的列表的发生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
所以,我想找到一组特定的词(应当,可能,必须等)每一页,并增加了它的发生,code我用:
So i am trying to find a set of specific word ("shall" "may" "must" etc) of each page, and add up its occurrence, the code I used:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
def levelfour(main_url):
pattern = re.compile(r"\bmay not\b", re.IGNORECASE)
pattern1 = re.compile(r"\bshall\b", re.IGNORECASE)
pattern2 = re.compile(r"\bmust\b", re.IGNORECASE)
pattern3 = re.compile(r"\bprohibited\b", re.IGNORECASE)
pattern4 = re.compile(r"\brequired\b", re.IGNORECASE)
r = requests.get(main_url)
soup = BeautifulSoup((r.content), "html.parser")
results = soup.find('article', {'id': 'maincontent'})
results = results.text.encode("utf-8", "ignore")
total = 0
total1 = 0
total2 = 0
total3 = 0
total4 = 0
m = re.findall(pattern, r.content)
m1 = re.findall(pattern1, r.content)
m2 = re.findall(pattern2, r.content)
m3 = re.findall(pattern3, r.content)
m4 = re.findall(pattern4, r.content)
total += len(m)
total1 += len(m1)
total2 += len(m2)
total3 += len(m3)
total4 += len(m4)
print total, total1, total2, total3, total4
########################################Sections##########################
def levelthree(item2_url):
r = requests.get(item2_url)
for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if sectionlinks.has_attr('href'):
if 'section' in sectionlinks['href']:
href = "http://law.justia.com" + sectionlinks.get('href')
levelfour(href)
########################################Chapters##########################
def leveltwo(item_url):
r = requests.get(item_url)
for sublinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if sublinks.has_attr('href'):
if 'chapt' in sublinks['href']:
chapterlinks = "http://law.justia.com" + sublinks.get('href')
levelthree(chapterlinks)
print (chapterlinks)
######################################Titles###############################
def levelone(url):
r = requests.get(url)
for links in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if links.has_attr('href'):
if 'title-54' in links['href']:
titlelinks = "http://law.justia.com" + links.get('href')
# titlelinks = "\n" + str(titlelinks)
leveltwo(titlelinks)
# print (titlelinks)
###########################################################################
base_url = "http://law.justia.com/codes/idaho/2015/"
levelone(base_url)
当我打印出来总,共1页,共2条,共3条,共4,它给出了一个零,而不是[0,0,0,0,0]我的问题,你怎么可以适当查找和添加了这组的出现或话?
when I print out total, total1, total2, total3, total4, it gives a zeros instead [0, 0, 0, 0, 0 ] my question, how do can appropriately find and add up the occurrence of this set or words?
推荐答案
使用 M =通过re.findall(图案,r.content)
解决了这一问题。
这篇关于Python网络刮,计数每一页的字的列表的发生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文