如何检查任务是否已在python Queue中? [英] How check if a task is already in python Queue?

查看:801
本文介绍了如何检查任务是否已在python Queue中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用线程和队列模块在Python中编写一个简单的搜寻器.我获取一个页面,检查链接并将它们放入队列中,当某个线程完成了页面处理后,它将从队列中获取下一个页面.我对已经访问过的页面使用数组,以过滤添加到队列的链接,但是如果有多个线程并且它们在不同页面上获得相同的链接,则会将重复的链接放置到队列中.那么,如何确定队列中是否已经有一些url,以避免再次将其放置在队列中?

I'm writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished processing page, it grabs the next one from the queue. I'm using an array for the pages I've already visited to filter the links I add to the queue, but if there are more than one threads and they get the same links on different pages, they put duplicate links to the queue. So how can I find out whether some url is already in the queue to avoid putting it there again?

推荐答案

如果您不关心项目的处理顺序,则可以尝试在内部使用setQueue子类:

If you don't care about the order in which items are processed, I'd try a subclass of Queue that uses set internally:

class SetQueue(Queue):

    def _init(self, maxsize):
        self.maxsize = maxsize
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()

正如Paul McGuire所指出的,这将允许在将重复项从待处理"集中删除但尚未添加到已处理"集中之后添加重复项.为了解决这个问题,您可以将两个集合都存储在Queue实例中,但是由于您使用更大的集合来检查项目是否已被处理,因此您也可以返回到queue,它将正确地订购请求.

As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue which will order requests properly.

class SetQueue(Queue):

    def _init(self, maxsize):
        Queue._init(self, maxsize) 
        self.all_items = set()

    def _put(self, item):
        if item not in self.all_items:
            Queue._put(self, item) 
            self.all_items.add(item)

与单独使用一个集合相比,此方法的优点是Queue的方法是线程安全的,因此您不需要其他锁定即可检查另一个集合.

The advantage of this, as opposed to using a set separately, is that the Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.

这篇关于如何检查任务是否已在python Queue中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆