Python Web抓取:睡眠和请求(页面,超时= x)之间的区别 [英] Python web scraping: difference between sleep and request(page, timeout=x)
问题描述
在一个循环中抓取多个网站时,我注意到它们之间的速度差异很大,
When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,
sleep(10)
response = requests.get(url)
和
response = requests.get(url, timeout=10)
也就是说,超时
要快得多。
而且,对于这两个组ups在请求下一页之前,我预计每页的刮擦时间至少为10秒,但事实并非如此。
Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.
- 为什么速度会有这种差异?
- 为什么每页的抓取时间短超过10秒?
我现在使用多重处理,但是我想记住上述内容对于非多重处理也适用。
I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.
推荐答案
time.sleep
会阻止脚本运行一定的时间秒,而超时
是等待检索URL的最长时间。如果在超时
时间超时之前检索数据,则剩余时间将被跳过。因此,使用 timeout
可能花费不到10秒的时间。
time.sleep
stops your script from running for certain amount of seconds, while the timeout
is the maximum time wait for retrieving the url. If the data is retrieved before the timeout
time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout
.
time.sleep
是不同的,它将完全暂停脚本,直到完成睡眠,然后再花几秒钟运行您的请求。因此, time.sleep
每次都将花费10秒钟以上。
time.sleep
is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep
will take more than 10 seconds every time.
它们的用法非常不同,但是对于您的情况,您应该设置一个计时器,这样,如果它在10秒钟之前完成,则使程序等待。
They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.
这篇关于Python Web抓取:睡眠和请求(页面,超时= x)之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!