如何在asyncio中同时运行任务? [英] How to run tasks concurrently in asyncio?

查看:339
本文介绍了如何在asyncio中同时运行任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习如何使用Python的asyncio模块同时运行任务.在下面的代码中,我有一个模拟的网络爬虫"作为示例.基本上,我试图使它在任何给定时间最多发生两个活动的fetch()请求的地方,并且我希望在sleep()期间调用process().

I'm trying to learn how to run tasks concurrently using Python's asyncio module. In the following code, I've got a mock "web crawler" for an example. Basically, I am trying to make it where there are a max of two active fetch() requests happening at any given time, and I want process() to be called during the sleep() period.

import asyncio

class Crawler():

    urlq = ['http://www.google.com', 'http://www.yahoo.com', 
            'http://www.cnn.com', 'http://www.gamespot.com', 
            'http://www.facebook.com', 'http://www.evergreen.edu']

    htmlq = []
    MAX_ACTIVE_FETCHES = 2
    active_fetches = 0

    def __init__(self):
        pass

    async def fetch(self, url):
        self.active_fetches += 1
        print("Fetching URL: " + url);
        await(asyncio.sleep(2))
        self.active_fetches -= 1
        self.htmlq.append(url)

    async def crawl(self):
        while self.active_fetches < self.MAX_ACTIVE_FETCHES:
            if self.urlq:
                url = self.urlq.pop()
                task = asyncio.create_task(self.fetch(url))
                await task
            else:
                print("URL queue empty")
                break;

    def process(self, page):
        print("processed page: " + page)

# main loop

c = Crawler()
while(c.urlq):
    asyncio.run(c.crawl())
    while c.htmlq:
        page = c.htmlq.pop()
        c.process(page)

但是,上面的代码会一个接一个地下载URL(一次不是两个),并且直到获取完所有URL后才进行任何处理".我该如何使fetch()任务并发运行,并使其在sleep()期间在两者之间调用process()?

However, the code above downloads the URLs one by one (not two at a time concurrently) and doesn't do any "processing" until after all URLs have been fetched. How can I make the fetch() tasks run concurrently, and make it so that process() is called in between during sleep()?

推荐答案

您的crawl方法在每个单独的任务之后都在等待;您应该将其更改为此:

Your crawl method is waiting after each individual task; you should change it to this:

async def crawl(self):
    tasks = []
    while self.active_fetches < self.MAX_ACTIVE_FETCHES:
        if self.urlq:
            url = self.urlq.pop()
            tasks.append(asyncio.create_task(self.fetch(url)))
    await asyncio.gather(*tasks)

EDIT :这是带有注释的更干净的版本,它可以同时提取和处理所有注释,同时保留了限制提取器最大数量的基本功能.

EDIT: Here's a cleaner version with comments that fetches and processes all at the same time, while preserving the basic ability to put a cap on the maximum number of fetchers.

import asyncio

class Crawler:

    def __init__(self, urls, max_workers=2):
        self.urls = urls
        # create a queue that only allows a maximum of two items
        self.fetching = asyncio.Queue()
        self.max_workers = max_workers

    async def crawl(self):
        # DON'T await here; start consuming things out of the queue, and
        # meanwhile execution of this function continues. We'll start two
        # coroutines for fetching and two coroutines for processing.
        all_the_coros = asyncio.gather(
            *[self._worker(i) for i in range(self.max_workers)])

        # place all URLs on the queue
        for url in self.urls:
            await self.fetching.put(url)

        # now put a bunch of `None`'s in the queue as signals to the workers
        # that there are no more items in the queue.
        for _ in range(self.max_workers):
            await self.fetching.put(None)

        # now make sure everything is done
        await all_the_coros

    async def _worker(self, i):
        while True:
            url = await self.fetching.get()
            if url is None:
                # this coroutine is done; simply return to exit
                return

            print(f'Fetch worker {i} is fetching a URL: {url}')
            page = await self.fetch(url)
            self.process(page)

    async def fetch(self, url):
        print("Fetching URL: " + url);
        await asyncio.sleep(2)
        return f"the contents of {url}"

    def process(self, page):
        print("processed page: " + page)


# main loop
c = Crawler(['http://www.google.com', 'http://www.yahoo.com', 
             'http://www.cnn.com', 'http://www.gamespot.com', 
             'http://www.facebook.com', 'http://www.evergreen.edu'])
asyncio.run(c.crawl())

这篇关于如何在asyncio中同时运行任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆