使用与 asyncio 关联的 pyppeteer 抓取内容 [英] Scraping content using pyppeteer in association with asyncio

查看:44
本文介绍了使用与 asyncio 关联的 pyppeteer 抓取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用 python 编写了一个脚本,结合 pyppeteerasyncio 来从其登陆页面抓取不同帖子的链接,并最终获得每个帖子的标题通过跟踪指向其内页的 url 来发布.我这里解析的内容不是动态的.但是,我使用了 pyppeteerasyncio 来查看它执行异步的效率.

I've written a script in python in combination with pyppeteer along with asyncio to scrape the links of different posts from its landing page and eventually get the title of each post by tracking the url leading to its inner page. The content I parsed here are not dynamic ones. However, I made use of pyppeteer and asyncio to see how efficiently it performs asynchronously.

以下脚本运行良好,但随后出现错误:

The following script goes well for some moments but then enounters an error:

File "C:\Users\asyncio\tasks.py", line 526, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required

这是我目前所写的:

import asyncio
from pyppeteer import launch

link = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    tasks = [await browse_all_links(link, page) for link in linkstorage]
    results = await asyncio.gather(*tasks)
    return results

async def browse_all_links(link, page):
    await page.goto(link)
    title = await page.querySelectorEval('.question-hyperlink','(e => e.innerText)')
    print(title)

async def main(url):
    browser = await launch(headless=True,autoClose=False)
    page = await browser.newPage()
    await fetch(page,url)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(main(link))
    loop.run_until_complete(future)
    loop.close()

我的问题:我怎样才能摆脱那个错误并异步执行?

推荐答案

问题出在以下几行:

tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)

目的是让 tasks 成为可等待对象的列表,例如协程对象或期货.该列表将传递给 gather,以便等待对象可以并行运行,直到它们全部完成.然而,列表推导式包含一个await,这意味着它:

The intention is for tasks to be a list of awaitable objects, such as coroutine objects or futures. The list is to be passed to gather, so that the awaitables can run in parallel until they all complete. However, the list comprehension contains an await, which means that it:

  • 执行每个browser_all_links串行完成,而不是并行;
  • browse_all_links 调用的返回值放入列表中.
  • executes each browser_all_links to completion in series rather than in parallel;
  • places the return values of browse_all_links invocations into the list.

由于 browse_all_links 不返回值,您将 None 对象列表传递给 asyncio.gather,它抱怨它没有得到可等待的对象.

Since browse_all_links doesn't return a value, you are passing a list of None objects to asyncio.gather, which complains that it didn't get an awaitable object.

要解决此问题,只需从列表推导式中删除 await.

To resolve the issue, just drop the await from the list comprehension.

这篇关于使用与 asyncio 关联的 pyppeteer 抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆