自动检索网站 [英] automatic crawling web site

查看:84
本文介绍了自动检索网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从这里得到了帮助,可以使用以下代码爬到law.go.kr上.
我正在尝试抓取其他网站,例如 http://lawbot.org https://casenote.kr .
但是问题是我不了解html ...
我了解所有代码以及如何获取以下代码的html地址,但在其他网站上却有所不同...
我想知道如何使用下面的代码来爬网其他网页.

I got help from here to crawl on law.go.kr with the code below.
I'm trying to crawl other websites like http://lawbot.org, http://law.go.kr, https://casenote.kr.
But problem is that I have no understanding of html...
I understood all the code and how to get html address for the code below but it's different on other websites...
I want to know how to use the code below to crawl other web pages.

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':

    # Using request get 50 items from first page. pg=1 is page number, outmax=50 items 
per page
    response = requests.post(
        "http://law.go.kr/precScListR.doq=*&section=evtNm&outmax=79329&pg=1&fsort=21,10,30&precSeq=0&dtlYn=N")

    # Parse html using BeautifulSoup
    page = BeautifulSoup(response.text, "html.parser")

    # Go through all pages and collect posts numbers in items
    items = []
    for i in range(1, 2):
        # Get all links
        links = page.select("#viewHeightDiv .s_tit a")
        # Loop all links and collect post numbers
        for link in links:
            # Parse post number from "onclick" attribute
            items.append(''.join([n for n in link.attrs["onclick"] if n.isdigit()]))

    # Open all posts and collect in posts dictionary with keys: number, url and text
    posts = []
    for item in items:
        url = "http://law.go.kr/precInfoR.do?precSeq=%s&vSct=*" % item
        response = requests.get(url)
        parsed = BeautifulSoup(response.text, "html.parser")
        text = parsed.find('div', attrs={'id': 'contentBody'}).text     #전문 저장 
'id': 'contentBody', 제목제외 저장 'class': 'pgroup'
        title = parsed.select_one("h2").text
        posts.append({'number': item, 'url': url, 'text': text, 'title': title})

        with open("D://\LAWGO_DATA/" + item + '.txt', 'w', encoding='utf8') as f:
            f.write(text)

推荐答案

lawbot.org 的另一个示例:

import requests
from bs4 import BeautifulSoup

base_url = 'http://lawbot.org'
search_url = base_url + '/?q=유죄'

response = requests.get(search_url)

page = BeautifulSoup(response.text, "html.parser")
lastPageNumber = int(page.select_one("li.page-item:not(.next):nth-last-child(2)").text)

casesList = []

for i in range(1, lastPageNumber + 1):
    if i > 1:
        response = requests.get(search_url + "&page=" + str(i))
        page = BeautifulSoup(response.text, "html.parser")

    cases = page.select("div.panre_center > ul.media-list li.panre_lists")
    for case in cases:
        title = case.findChild("h6").text
        caseDocNumber = case.findChild(attrs={"class": "caseDocNumber"}).text
        caseCourt = case.findChild(attrs={"class": "caseCourt"}).text
        case_url = base_url + case.findChild("a")['href']

        casesList.append({"title": title, "caseDocNumber": caseDocNumber, "caseCourt": caseCourt, "case_url": case_url})
        # print("title:{}, caseDocNumber:{}, caseCourt:{}, caseUrl:{}".format(title, caseDocNumber, caseCourt, case_url))

for case in casesList:
    response = requests.get(case["case_url"])
    page = BeautifulSoup(response.text, "html.parser")
    body = page.find(attrs={"class": "panre_body"}).text
    print(body)

这篇关于自动检索网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆