带有嵌套 Web 请求的 Gevent 池 [英] Gevent pool with nested web requests

查看:22
本文介绍了带有嵌套 Web 请求的 Gevent 池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试组织最多 10 个并发下载的池.该函数应下载基本 url,然后解析此页面上的所有 url 并下载每个 url,但同时下载的 OVERALL 数量不应超过 10.

I try to organize pool with maximum 10 concurrent downloads. The function should download base url, then parser all urls on this page and download each of them, but OVERALL number of concurrent downloads should not exceed 10.

from lxml import etree 
import gevent
from gevent import monkey, pool
import requests

monkey.patch_all()
urls = [
    'http://www.google.com', 
    'http://www.yandex.ru', 
    'http://www.python.org', 
    'http://stackoverflow.com',
    # ... another 100 urls
    ]

LINKS_ON_PAGE=[]
POOL = pool.Pool(10)

def parse_urls(page):
    html = etree.HTML(page)
    if html:
        links = [link for link in html.xpath("//a/@href") if 'http' in link]
    # Download each url that appears in the main URL
    for link in links:
        data = requests.get(link)
        LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))

def get_base_urls(url):
    # Download the main URL
    data = requests.get(url)
    parse_urls(data.content)

如何将其组织为并发方式,但要保持所有 Web 请求的通用全局池限制?

How can I organize it to go concurrent way, but to keep the general global Pool limit for ALL web requests?

推荐答案

我认为以下内容应该可以满足您的需求.我在示例中使用 BeautifulSoup 而不是您拥有的链接条带化内容.

I think the following should get you what you want. I'm using BeautifulSoup in my example instead the link striping stuff you had.

from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()

jobs = []
links = []
p = pool.Pool(10)

urls = [
    'http://www.google.com', 
    # ... another 100 urls
]
    
def get_links(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.text)
        links.extend(soup.find_all('a'))

for url in urls:
    jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)
 

这篇关于带有嵌套 Web 请求的 Gevent 池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆