我的脚本应该异步运行时遇到错误 [英] My script encounters an error when it is supposed to run asynchronously

查看:292
本文介绍了我的脚本应该异步运行时遇到错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 asyncio 关联和 aiohttp 库在python中编写了一个脚本,以解析流行名称点击此网站异步。该网页显示了513页中的表格内容。

I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.

我遇到此错误 select()中的文件描述符过多当我尝试使用 asyncio.get_event_loop()但当我遇到我可以看到有人建议使用 asyncio.ProactorEventLoop()为避免此类错误,因此我使用了后者,但注意到,即使我遵守了建议,脚本也只从几页中收集名称,直到引发以下错误。我该如何解决呢?

I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?

raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]

这是我到目前为止的尝试:

This is my try so far with:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

async def get_links(url):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await process_docs(text)
            return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
                sauce = BeautifulSoup(text,"lxml")
                try:
                    name = sauce.select_one("p > b").text
                except Exception: name = ""
                print(name)

if __name__ == '__main__':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

简而言之, process_docs()函数的作用是收集 data-id 每页中的数字,以将其重新用作此 https://www.tursab.org.tr/en/displayAcenta?AID= {} 链接以从弹出框中收集名称。一个这样的ID是 8757 ,另一个这样的合格链接因此是 https://www.tursab.org .tr / en / displayAcenta?AID = 8757

In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.

顺便说一句,如果我更改了中使用的最高数字

Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.

推荐答案

async def get_links(url):
    async with asyncio.Semaphore(10):

您不能做这样的事情:这意味着在每个函数调用上都会创建新的信号量实例,而您需要为所有请求使用单个信号量实例。以此方式更改代码:

You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:

sem = asyncio.Semaphore(10)  # module level

async def get_links(url):
    async with sem:
        # ...


async def fetch_again(link):
    async with sem:
        # ...

正确使用信号量后,您还可以返回默认循环:

You can also return default loop once you're using semaphore correctly:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(...)

最后,您应该同时更改两个 get_links(url) fetch_again(link)在信号量之外进行解析以尽快释放它,然后在内部需要信号量之前process_docs(text)

Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).

最终代码:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

sem = asyncio.Semaphore(10)

async def get_links(url):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
    result = await process_docs(text)
    return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
    sauce = BeautifulSoup(text,"lxml")
    try:
        name = sauce.select_one("p > b").text
    except Exception:
        name = "o"
    print(name)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

这篇关于我的脚本应该异步运行时遇到错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆