适用于网络抓取工具的Python线程处理或多处理? [英] Python threading or multiprocessing for web-crawler?

查看:71
本文介绍了适用于网络抓取工具的Python线程处理或多处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Python制作了简单的网络抓取工具.到目前为止,它所做的一切都会创建一组应访问的URL,一组已访问的URL.解析页面时,它将页面上的所有链接添加到应该访问的集合中,并将页面URL添加到已经访问的集合,依此类推,而should_be_visited的长度>0.到目前为止,它在一个线程中完成了所有工作.

I've made simple web-crawler with Python. So far everything it does it creates set of urls that should be visited, set of urls that was already visited. While parsing page it adds all the links on that page to the should be visited set and page url to the already visited set and so on while length of should_be_visited is > 0. So far it does everything in one thread.

现在,我想向此应用程序添加并行性,因此我需要具有相同类型的链接集和少量线程/进程,其中每个线程/进程将从should_be_visited中弹出一个URL,并更新已经_visited.我真的迷失了应该使用的线程和多处理功能,是否需要一些池,队列?

Now I want to add parallelism to this application, so I need to have same kind of set of links and few threads / processes, where each will pop one url from should_be_visited and update already_visited. I'm really lost at threading and multiprocessing, which I should use, do I need some Pools, Queues?

推荐答案

在决定是否在Python中使用线程时,经验法则是提出一个问题,即线程将执行的任务是CPU密集型或I/O密集型.如果答案是I/O密集型,则可以使用线程.

The rule of thumb when deciding whether to use threads in Python or not is to ask the question, whether the task that the threads will be doing, is that CPU intensive or I/O intensive. If the answer is I/O intensive, then you can go with threads.

由于GIL,Python解释器一次只能运行一个线程.如果线程正在执行某些I/O,它将阻止等待数据变为可用(例如,从网络连接或磁盘),并且与此同时,解释器将上下文切换到另一个线程.另一方面,如果线程正在执行CPU密集型任务,则其他线程将不得不等待,直到解释器决定运行它们为止.

Because of the GIL, the Python interpreter will run only one thread at a time. If a thread is doing some I/O, it will block waiting for the data to become available (from the network connection or the disk, for example), and in the meanwhile the interpreter will context switch to another thread. On the other hand, if the thread is doing a CPU intensive task, the other threads will have to wait till the interpreter decides to run them.

Web爬网主要是面向I/O的任务,您需要建立HTTP连接,发送请求,等待响应.是的,获得响应后,您需要花费一些CPU来解析它,但是除此之外,它主要是I/O工作.因此,我认为在这种情况下,线程是一个合适的选择.

Web crawling is mostly an I/O oriented task, you need to make an HTTP connection, send a request, wait for response. Yes, after you get the response you need to spend some CPU to parse it, but besides that it is mostly I/O work. So, I believe, threads are a suitable choice in this case.

(当然,请注意robots.txt,不要对请求过多的服务器进行攻击:-)

(And of course, respect the robots.txt, and don't storm the servers with too many requests :-)

这篇关于适用于网络抓取工具的Python线程处理或多处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆