Python Web抓取:睡眠和请求(页面,超时= x)之间的区别 [英] Python web scraping: difference between sleep and request(page, timeout=x)

查看:127
本文介绍了Python Web抓取:睡眠和请求(页面,超时= x)之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个循环中抓取多个网站时,我注意到它们之间的速度差异很大,

When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,

sleep(10)
response = requests.get(url)

response = requests.get(url, timeout=10)

也就是说,超时要快得多。

而且,对于这两个组ups在请求下一页之前,我预计每页的刮擦时间至少为10秒,但事实并非如此。

Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.


  1. 为什么速度会有这种差异?

  2. 为什么每页的抓取时间短超过10秒?

我现在使用多重处理,但是我想记住上述内容对于非多重处理也适用。

I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.

推荐答案

time.sleep 会阻止脚本运行一定的时间秒,而超时是等待检索URL的最长时间。如果在超时时间超时之前检索数据,则剩余时间将被跳过。因此,使用 timeout 可能花费不到10秒的时间。

time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.

time.sleep 是不同的,它将完全暂停脚本,直到完成睡眠,然后再花几秒钟运行您的请求。因此, time.sleep 每次都将花费10秒钟以上。

time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.

它们的用法非常不同,但是对于您的情况,您应该设置一个计时器,这样,如果它在10秒钟之前完成,则使程序等待。

They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.

这篇关于Python Web抓取:睡眠和请求(页面,超时= x)之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆