无法修改我的脚本以在抓取时分隔请求数 [英] Can't modify my script to delimit the number of requests while scraping

查看:46
本文介绍了无法修改我的脚本以在抓取时分隔请求数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用 Thread 在 python 中编写了一个脚本来同时处理多个请求并更快地执行抓取过程.脚本正在相应地完成它的工作.

I've written a script in python using Thread to handl multiple requests at the same time and do the scraping process faster. The script is doing it's job accordingly.

简而言之,抓取工具的作用是:它解析从着陆页到其主页(存储信息的位置)的所有链接,并抓取happy hoursfeatured special 从那里.抓取工具会一直运行,直到所有 29 个页面都被抓取.

In short what the scraper does: It parses all the links from the landing page leading to its main page (where information are stored) and scrape happy hours and featured special from there. The scrapers keeps going on until all the 29 pages are crawled.

因为可能有很多链接可以玩,我想限制请求的数量.但是,由于我对此没有太多想法,因此我找不到任何理想的方法来修改现有脚本以达到目的.

As there may be numerous links to play with, I would like to limit the number of requests. However, as I don't have much idea on this I can't find any ideal way to modify my existing script to serve the purpose.

非常感谢任何帮助.

这是我目前的尝试:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading

url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"

def get_info(link):
    for mlink in [link.format(page) for page in range(1,30)]:
        response = requests.get(mlink)
        soup = BeautifulSoup(response.text,"lxml")
        itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
        threads = []
        for ilink in itemlinks:
            thread = threading.Thread(target=fetch_info,args=(ilink,))
            thread.start()
            threads+=[thread]

        for thread in threads:
            thread.join()

def fetch_info(nlink):
    response = requests.get(nlink)
    soup = BeautifulSoup(response.text,"lxml")
    for container in soup.select(".specials"):
        try:
            hours = container.select_one("h3").text
        except Exception: hours = ""
        try:
            fspecial = ' '.join([item.text for item in container.select(".special")])
        except Exception: fspecial = ""
        print(f'{hours}---{fspecial}')

if __name__ == '__main__':
    get_info(url)

推荐答案

由于我对使用 multiprocessing 创建任何刮刀非常陌生,我希望能有任何真正的-life 脚本,以便非常清楚地理解逻辑.脚本中使用的站点具有一些机器人保护机制.但是,我发现了一个非常相似的网页,可以在其中应用 multiprocessing.

As I'm very new to create any scraper using multiprocessing, I expected to have any real-life script in order to understand the logic very clearly. The site used within the script has some bot protection mechanism. However, i've found out a very similar webpage to apply multiprocessing within it.

import requests
from multiprocessing import Pool
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    completelinks = []
    for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
        res = requests.get(link.format(ilink))  
        soup = BeautifulSoup(res.text,'lxml')
        for items in soup.select("table.border tr"):
            if not items.select("td a[href^='index.php?agent']"):continue
            data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
            completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr"):
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    allurls = get_links(url)
    with Pool(10) as p:  ##this is the number responsible for limiting the number of requests
        p.map(get_info,allurls)
        p.join()

这篇关于无法修改我的脚本以在抓取时分隔请求数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆