即使脚本异步运行,脚本执行也会非常缓慢 [英] Script performs very slowly even when it runs asynchronously

查看:145
本文介绍了即使脚本异步运行,脚本执行也会非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在 asyncio 中与 aiohttp 库可异步解析网站内容。我已经尝试在下面的脚本中应用逻辑,就像通常在 scrapy 中使用它一样。。 p>

但是,当我执行脚本时,它的行为就像 这样的同步库 em> 或 urllib.request 。因此,它非常慢并且无法达到目的。



我知道我可以通过定义 链接 变量。但是,我是否已经不以正确的方式使用现有脚本来完成任务?



在脚本中,什么 processing_docs()函数所做的是收集不同帖子的所有链接,并将经过精炼的链接传递给 fetch_again()函数,以从其目标页面中获取标题。 processing_docs()函数内应用了一种逻辑,该逻辑收集next_page链接并将其提供给 fetch()函数重复相同。 此next_page调用使脚本变慢,而我们通常在 scrapy 中执行相同操作,并获得预期的性能。



我的问题是:如何在保持现有逻辑不变的情况下实现相同的目标?

  import aiohttp 
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = https://stackoverflow.com/questions/tagged/web-scraping

异步def fetch(url):
异步,其中aiohttp.ClientSession()作为会话:
与session.get(url)作为响应异步:
text = await response.text()
result = await processing_docs(session,text)
返回结果

async def processing_docs(session,html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib ['href']))表示tree.cssselect( .summary .question-hyperlink)]
标题中的标题:
等待fetch_again(会话,标题)

next_page = tree.cssselect( div.pager a [rel ='next']))
如果next_page:
page_link = urljoin(link,next_page [0] .attrib ['href'])
等待获取(page_link)

异步def fetch_again(session,url):
与session.get(url)作为响应:
text =等待response.text()
tree = fromstring(text)
title = tree.cssselect( h1 [itemprop ='name'] a)[0] .text
print(title)

if __name__ =='__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete (asyncio.gather(*([link]中的URL提取(url))))))
loop.close()


解决方案

使用asyncio的全部目的是,您可以同时(彼此并行)运行多个提取。让我们看一下您的代码:

 标题中的标题:
等待fetch_again(session,title)

这部分意味着每个新的 fetch_again 仅在之前的等待(完成)。如果您以这种方式进行操作,是的,使用同步方法没有什么区别。



要调用asyncio的所有功能,请使用 asyncio同时启动多个提取.gather

 等待asyncio.gather(* [
fetch_again(session,title )标题中的
,标题中的
])

会看到明显的加速。






您可以进一步进行事件并开始获取用于下一页,同时使用 fetch_again 用于标题:

  async def processing_docs(session,html):
coros = []

树= fromstring(html)

#标题:
标题= [
urljoin(link,title.attrib ['href'])
在tree.cssselect(。summary .question-hyperlink)中的标题

]

用于标题中的标题:
coros.append(
fetch_again(session,title)


#next_page:
next_page = tree.cssselect( div.pager a [rel ='next'])
如果next_page:
page_link = urljoin(link,next_page [0] .attrib ['href'])

coros.append(
fetch(page_link)


#等待:
等待asyncio.gather(* coros)



< hr>

重要提示



虽然这种方法可以让您更快地执行操作限制当时的并发请求数,以避免在计算机和服务器上大量使用资源。



您可以使用 asyncio.Semaphore 为此:

 信号量= asyncio。信号量(10)

异步def fetch(URL):
与信号量异步:
与aiohttp.ClientSession()作为会话异步:
与会话异步。 get(url)作为响应:
文本=等待response.text()
结果=等待process_docs(会话,文本)
返回结果


I've written a script in asyncio in association with aiohttp library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy.

However, when I execute my script, it acts like how syncronous libraries like requests or urllib.request do. Therefore, it is very slow and doesn't serve the purpose.

I know I can get around this by defining all the next page link within the link variable. But, am I not doing the task with my existing script in the right way already?

Within the script what processing_docs() function does is collect all the links of the different posts and pass the refined links to the fetch_again() function to fetch the title from it's target page. There is a logic applied within processing_docs() function which collects the next_page link and supply the same to fetch() function to repeat the same. This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.

My question is: How can I achieve the same keeping the existing logic intact?

import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            text = await response.text()
            result = await processing_docs(session, text)
        return result

async def processing_docs(session, html):
        tree = fromstring(html)
        titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
        for title in titles:
            await fetch_again(session,title)

        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])
            await fetch(page_link)

async def fetch_again(session,url):
    async with session.get(url) as response:
        text = await response.text()
        tree = fromstring(text)
        title = tree.cssselect("h1[itemprop='name'] a")[0].text
        print(title)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
    loop.close()

解决方案

Whole point of using asyncio is that you may run multiple fetches concurrently (in parallel to each other). Let's look at your code:

for title in titles:
    await fetch_again(session,title)

This part means that each new fetch_again will be started only after previous was awaited (finished). If you do things this way, yes, there's no difference with using sync approach.

To invoke all power of asyncio start multiple fetches concurrently using asyncio.gather:

await asyncio.gather(*[
    fetch_again(session,title)
    for title 
    in titles
])

You'll see significant speedup.


You can go event futher and start fetch for next page concurrently with fetch_again for titles:

async def processing_docs(session, html):
        coros = []

        tree = fromstring(html)

        # titles:
        titles = [
            urljoin(link,title.attrib['href']) 
            for title 
            in tree.cssselect(".summary .question-hyperlink")
        ]

        for title in titles:
            coros.append(
                fetch_again(session,title)
            )

        # next_page:
        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])

            coros.append(
                fetch(page_link)
            )

        # await:
        await asyncio.gather(*coros)


Important note

While such approach allows you to do things much faster you may want to limit number of concurrent requests at the time to avoid significant resources usage on both your machine and on server.

You can use asyncio.Semaphore for this purpose:

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result

这篇关于即使脚本异步运行,脚本执行也会非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆