Python网络刮,计数每一页的字的列表的发生 [英] Python web scraping, counting the occurrence of a list of words of each page

查看:178
本文介绍了Python网络刮,计数每一页的字的列表的发生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我想找到一组特定的词(应当,可能,必须等)每一页,并增加了它的发生,code我用:

So i am trying to find a set of specific word ("shall" "may" "must" etc) of each page, and add up its occurrence, the code I used:

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re


def levelfour(main_url):

    pattern = re.compile(r"\bmay not\b", re.IGNORECASE)
    pattern1 = re.compile(r"\bshall\b", re.IGNORECASE)
    pattern2 = re.compile(r"\bmust\b", re.IGNORECASE)
    pattern3 = re.compile(r"\bprohibited\b", re.IGNORECASE)
    pattern4 = re.compile(r"\brequired\b", re.IGNORECASE)

    r = requests.get(main_url)
    soup = BeautifulSoup((r.content), "html.parser")
    results = soup.find('article', {'id': 'maincontent'})
    results = results.text.encode("utf-8", "ignore")

    total = 0
    total1 = 0
    total2 = 0
    total3 = 0
    total4 = 0

    m = re.findall(pattern, r.content)
    m1 = re.findall(pattern1, r.content)
    m2 = re.findall(pattern2, r.content)
    m3 = re.findall(pattern3, r.content)
    m4 = re.findall(pattern4, r.content)
    total += len(m)
    total1 += len(m1)
    total2 += len(m2)
    total3 += len(m3)
    total4 += len(m4)
    print total, total1, total2, total3, total4

########################################Sections##########################
def levelthree(item2_url):
 r = requests.get(item2_url)
 for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if sectionlinks.has_attr('href'):
   if 'section' in sectionlinks['href']:
         href = "http://law.justia.com" + sectionlinks.get('href')
         levelfour(href)

########################################Chapters##########################
def leveltwo(item_url):
 r = requests.get(item_url)
 for sublinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if sublinks.has_attr('href'):
   if 'chapt' in sublinks['href']:
         chapterlinks = "http://law.justia.com" + sublinks.get('href')
         levelthree(chapterlinks)
         print (chapterlinks)

######################################Titles###############################
def levelone(url):
 r = requests.get(url)
 for links in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
    if links.has_attr('href'):
        if 'title-54' in links['href']:
          titlelinks = "http://law.justia.com" + links.get('href')
          # titlelinks = "\n" + str(titlelinks)
          leveltwo(titlelinks)
          # print (titlelinks)

###########################################################################
base_url = "http://law.justia.com/codes/idaho/2015/"
levelone(base_url)

当我打印出来总,共1页,共2条,共3条,共4,它给出了一个零,而不是[0,0,0,0,0]我的问题,你怎么可以适当查找和添加了这组的出现或话?

when I print out total, total1, total2, total3, total4, it gives a zeros instead [0, 0, 0, 0, 0 ] my question, how do can appropriately find and add up the occurrence of this set or words?

推荐答案

使用 M =通过re.findall(图案,r.content)解决了这一问题。

这篇关于Python网络刮,计数每一页的字的列表的发生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆