如何在Python中结合使用Request和BeautifulSoup来加速Web抓取? [英] How to accelerate Webscraping using the combination of Request and BeautifulSoup in Python?

查看:61
本文介绍了如何在Python中结合使用Request和BeautifulSoup来加速Web抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标是抓取多个

The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module.

步骤是:

首先使用 requests

page = requests.get('https://oatd.org/oatd/' + url_to_pass)

然后,使用以下定义抓取 html 内容:

Then, scrape the html content using the definition below:

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

说,我们有100个唯一的网址需要 scraping (要刮掉的) ['record?record = handle \:11012 \%2F16478& q = eeg'] * 100 ,整个过程可以通过以下代码完成:

Say, we have a hundred of unique url to be scraped ['record?record=handle\:11012\%2F16478&q=eeg'] * 100, the whole process can be completed via the code below:

import requests
from bs4 import BeautifulSoup as Soup

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
for url_to_pass in list_of_url:

    page = requests.get('https://oatd.org/oatd/' + url_to_pass)
    if page.status_code == 200:
        all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))

但是,每个网址都被请求并每次都刮一次,因此原则上很耗时.

However, each of the url is requested and scrape one each a time, hence in principle time consuming.

我想知道是否还有其他方法可以提高我不知道的上述代码的性能?

I wonder if there is other way to increase the performance of the above code that I am not aware of?

推荐答案

realpython.com上有一篇不错的文章,关于通过并发加快python脚本的速度.

realpython.com has a nice article about speeding up python scripts up with concurrency.

https://realpython.com/python-concurrency/

使用他们的线程示例,您可以设置工作程序的数量来执行多个线程,从而增加一次可以发出的请求的数量.

Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.

    from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    def get_each_page(page_soup):
        return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    def get_session():
        if not hasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    def download_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    def download_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100  # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

这篇关于如何在Python中结合使用Request和BeautifulSoup来加速Web抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆