如何使用BeautifulSoup遍历在多个网页上抓取多个文档? [英] How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?

查看:60
本文介绍了如何使用BeautifulSoup遍历在多个网页上抓取多个文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣从自然语言处理"项目的医疗文档网页中获取文本.我要抓取的网页文档文本没有设计任何语义标记,只是带有粗体标题的一大块文本.获得帮助后,从第一页开始,我感兴趣的是,我实现了以下代码以从网页中获取文档文本:

I am interested in grabbing text from a webpage of a medical documents for a Natural Language Processing project. The web page document text that I am scraping was not designed with any semantic markup, it's just a big blob of text with bold headings. After getting some help and starting with the first page I am interested in I've implemented the following code to grab the document text from the web page:

import requests
import re
from bs4 import BeautifulSoup, Tag, NavigableString, Comment

url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24- Gastroenterology&Sample=2332-Abdominal%20Abscess%20I&D'
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')

title_el = soup.find('h1')
page_title = title_el.text.strip()
first_hr = title_el.find_next_sibling('hr')

description_title = title_el.find_next_sibling('b', 
text=re.compile('description', flags=re.I))
description_text_parts = []
for s in description_title.next_siblings:
    if s is first_hr:
        break
    if isinstance(s, Tag):
        description_text_parts.append(s.text.strip())
    elif isinstance(s, NavigableString):
        description_text_parts.append(str(s).strip())
description_text = '\n'.join(p for p in description_text_parts if p.strip())

# titles are all bold and uppercase
titles = [b for b in first_hr.find_next_siblings('b') if b.text.strip().isupper()]

docs = []
for t in titles:
    text_parts = []
    for s in t.next_siblings:
        # go until next title
        if s in titles:
            break
        if isinstance(s, Comment):
            continue
        if isinstance(s, Tag):
            if s.name == 'div':
                break
            text_parts.append(s.text.strip())
        elif isinstance(s, NavigableString):
            text_parts.append(str(s).strip())
    text = '\n'.join(p for p in text_parts if p.strip())
    docs.append({
        'title': t.text.strip(),
        'text': text
    })

这会将我的文档文本作为字典,由title字段键和text值分隔,添加到名为docs的列表中.此时,在上面的示例中抓取的网页将是docs列表中的唯一元素.

This will add my document text as a dictionary separated by title field keys and text values to a list named docs. At this point the web page that was scraped in the above example would be the only element in the docs list.

我有兴趣创建一个循环,以从

I am interested in creating a loop to add all medical document records in the Gastroenterology section from the web page found at https://www.mtsamples.com/site/pages/browse.asp?type=24-Gastroenterology&page=1. There are 23 separate pages each with a number of different medical documents in alphabetical order containing a total of 230 medical documents. I am wondering what the best way to perform this loop would be? Again, my goal would be to append each medical document to the docs list as shown for the first example in my previous code. Any help would be much appreciated!

推荐答案

只需找到所有分页URL,然后遍历所有这些页面,找到文档URL并提取文档.这是为您提供的全面解决方案.

simply find all the pagination urls, then walk all those pages, find document urls and extract the documents. Here's a full-fledged solution for you.

这将同时浏览页面并从所有页面中批量提取文档

This will walk pages concurrently and extract documents from all pages in batches

import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor

BASE_URL = 'https://www.mtsamples.com'


def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    return soup


def make_soup_parallel(urls: list) -> list:
    workers = min(10, len(urls))
    with ThreadPoolExecutor(max_workers=workers) as e:
        return list(e.map(make_soup, urls))


def find_pagination_urls(soup: BeautifulSoup) -> list:
    urls = set()
    for a in soup.select('.Contrast a'):
        if not a.text.isnumeric():
            continue
        url = urljoin(BASE_URL, a['href'])
        urls.add(url)
    return sorted(list(urls), key=lambda u: int(u.split('page=')[1]))


def find_document_urls(soup: BeautifulSoup) -> list:
    urls = []
    for a in soup.select('#Browse a'):
        url = urljoin(BASE_URL, a['href'])
        urls.append(url)
    return urls


def find_all_doc_urls() -> list:
    index_url = 'https://www.mtsamples.com/site/pages/browse.asp?type=24-Gastroenterology&page=1'
    index_soup = make_soup(index_url)

    next_pages = find_pagination_urls(index_soup)
    doc_urls = []
    for soup in make_soup_parallel(next_pages):
        doc_urls.extend(find_document_urls(index_soup))
    return doc_urls


def extract_docs(soup: BeautifulSoup) -> list:
    title_el = soup.find('h1')
    first_hr = title_el.find_next_sibling('hr')

    # titles are all bold and uppercase
    titles = [b for b in first_hr.find_next_siblings('b') if b.text.strip().isupper()]

    docs = []
    for t in titles:
        text_parts = []
        for s in t.next_siblings:
            # go until next title
            if s in titles:
                break
            if isinstance(s, Comment):
                continue
            if isinstance(s, Tag):
                if s.name == 'div':
                    break
                text_parts.append(s.text.strip())
            elif isinstance(s, NavigableString):
                text_parts.append(str(s).strip())
        text = '\n'.join(p for p in text_parts if p.strip())
        docs.append({
            'title': t.text.strip(),
            'text': text
        })
    return docs


def batch(it, n: int):
    it = [iter(it)] * n
    return itertools.zip_longest(*it, fillvalue=None)


docs = []
doc_urls = find_all_doc_urls()

for b in batch(doc_urls, 5):
    batch_urls = list(filter(bool, b))
    for soup in make_soup_parallel(batch_urls):
        docs.extend(extract_docs(soup))
pprint(docs)

输出:

[{'text': 'Abdominal wall abscess.', 'title': 'PREOPERATIVE DIAGNOSIS:'},
 {'text': 'Abdominal wall abscess.', 'title': 'POSTOPERATIVE DIAGNOSIS:'},
 {'text': 'Incision and drainage (I&D) of abdominal abscess, excisional '
          'debridement of nonviable and viable skin, subcutaneous tissue and '
          'muscle, then removal of foreign body.',
  'title': 'PROCEDURE:'},
 {'text': 'LMA.', 'title': 'ANESTHESIA:'},
...

这篇关于如何使用BeautifulSoup遍历在多个网页上抓取多个文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆