asyncio 网页抓取 101:使用 aiohttp 获取多个 url [英] asyncio web scraping 101: fetching multiple urls with aiohttp

查看:27
本文介绍了asyncio 网页抓取 101:使用 aiohttp 获取多个 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在前面的问题中,aiohttp 的一位作者亲切地建议了使用 aiohttp 来获取多个网址的方法 使用来自 Python 3.5 的新 async with 语法:

In earlier question, one of authors of aiohttp kindly suggested way to fetch multiple urls with aiohttp using the new async with syntax from Python 3.5:

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是,当 session.get(url) 请求之一中断时(如上所述,因为 http://SDFKHSKHGKLHSKLJHGSDFKSJH.com),错误不会被处理,并且整个事情都坏了.

However when one of the session.get(url) requests breaks (as above because of http://SDFKHSKHGKLHSKLJHGSDFKSJH.com) the error is not handled and the whole thing breaks.

我寻找方法来插入关于 session.get(url) 结果的测试,例如寻找 try ... except ... 的位置,或者对于 if response.status != 200: 但我只是不明白如何使用 async withawait 和各种对象.

I looked for ways to insert tests about the result of session.get(url), for instance looking for places for a try ... except ..., or for a if response.status != 200: but I am just not understanding how to work with async with, await and the various objects.

由于 async with 还很新,所以例子并不多.如果 asyncio 向导可以展示如何做到这一点,这对很多人都会非常有帮助.毕竟大多数人想要用 asyncio 测试的第一件事就是同时获取多个资源.

Since async with is still very new there are not many examples. It would be very helpful to many people if an asyncio wizard could show how to do this. After all one of the first things most people will want to test with asyncio is getting multiple resources concurrently.

目标

目标是我们可以检查 the_results 并快速查看:

The goal is that we can inspect the_results and quickly see either:

  • 此网址失败(以及原因:状态代码,可能是异常名称),或
  • 这个网址有效,这是一个有用的响应对象

推荐答案

我会使用 gather 而不是 wait,后者可以将异常作为对象返回,而不会引发它们.然后你可以检查每个结果,如果它是某个异常的实例.

I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

测试:

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

这篇关于asyncio 网页抓取 101:使用 aiohttp 获取多个 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆