Python网络刮，计数每一页的字的列表的发生 [英] Python web scraping, counting the occurrence of a list of words of each page

查看：178 发布时间：2016/8/5 19:11:23 python regex python-2.7 beautifulsoup frequency

本文介绍了Python网络刮，计数每一页的字的列表的发生的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以，我想找到一组特定的词（应当，可能，必须等）每一页，并增加了它的发生，code我用：

So i am trying to find a set of specific word ("shall" "may" "must" etc) of each page, and add up its occurrence, the code I used:

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re


def levelfour(main_url):

    pattern = re.compile(r"\bmay not\b", re.IGNORECASE)
    pattern1 = re.compile(r"\bshall\b", re.IGNORECASE)
    pattern2 = re.compile(r"\bmust\b", re.IGNORECASE)
    pattern3 = re.compile(r"\bprohibited\b", re.IGNORECASE)
    pattern4 = re.compile(r"\brequired\b", re.IGNORECASE)

    r = requests.get(main_url)
    soup = BeautifulSoup((r.content), "html.parser")
    results = soup.find('article', {'id': 'maincontent'})
    results = results.text.encode("utf-8", "ignore")

    total = 0
    total1 = 0
    total2 = 0
    total3 = 0
    total4 = 0

    m = re.findall(pattern, r.content)
    m1 = re.findall(pattern1, r.content)
    m2 = re.findall(pattern2, r.content)
    m3 = re.findall(pattern3, r.content)
    m4 = re.findall(pattern4, r.content)
    total += len(m)
    total1 += len(m1)
    total2 += len(m2)
    total3 += len(m3)
    total4 += len(m4)
    print total, total1, total2, total3, total4

########################################Sections##########################
def levelthree(item2_url):
 r = requests.get(item2_url)
 for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if sectionlinks.has_attr('href'):
   if 'section' in sectionlinks['href']:
         href = "http://law.justia.com" + sectionlinks.get('href')
         levelfour(href)

########################################Chapters##########################
def leveltwo(item_url):
 r = requests.get(item_url)
 for sublinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if sublinks.has_attr('href'):
   if 'chapt' in sublinks['href']:
         chapterlinks = "http://law.justia.com" + sublinks.get('href')
         levelthree(chapterlinks)
         print (chapterlinks)

######################################Titles###############################
def levelone(url):
 r = requests.get(url)
 for links in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
    if links.has_attr('href'):
        if 'title-54' in links['href']:
          titlelinks = "http://law.justia.com" + links.get('href')
          # titlelinks = "\n" + str(titlelinks)
          leveltwo(titlelinks)
          # print (titlelinks)

###########################################################################
base_url = "http://law.justia.com/codes/idaho/2015/"
levelone(base_url)

当我打印出来总，共1页，共2条，共3条，共4，它给出了一个零，而不是[0，0，0，0，0]我的问题，你怎么可以适当查找和添加了这组的出现或话？

when I print out total, total1, total2, total3, total4, it gives a zeros instead [0, 0, 0, 0, 0 ] my question, how do can appropriately find and add up the occurrence of this set or words?

Python网络刮，计数每一页的字的列表的发生 [英] Python web scraping, counting the occurrence of a list of words of each page

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python网络刮，计数每一页的字的列表的发生 [英] Python web scraping, counting the occurrence of a list of words of each page

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭