pyspider - 为什么有时候会出现大量并发的fetch?

查看:351
本文介绍了pyspider - 为什么有时候会出现大量并发的fetch?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问 题

正常情况下,fetch大致按rate指定的频率执行
但有时候(项目还有task没完成,正在执行中),会接连出现多个select,然后是多个并发fetch请求,很容易触发网站的反爬机制
这个项目之前因为被ban而stop过,会不会是之前失败的task被并发调度了?

配置:rate/burst的设置为0.4/1.0,burst没法设置得更低了(否则不会执行)

log如下:

[I 160521 16:05:07 scheduler:771] select Project1:723240e17ce580fdb9887af195a8d138 (url)
[I 160521 16:05:08 processor:199] process Project1:ce739e1d152d603f2b54669ac2f87737 (url) -> [200] len:190386 -> result:None fol:1 msg:0 err:None
[I 160521 16:05:08 scheduler:712] task done Project1:ce739e1d152d603f2b54669ac2f87737 (url)
[I 160521 16:05:08 tornado_fetcher:410] [200] Project1:723240e17ce580fdb9887af195a8d138 (url) 0.13s
[I 160521 16:05:08 tornado_fetcher:410] [200] Project1:1fb7d805b99ef2ebe8ec81fa6075e966 (url) 0.18s
[I 160521 16:05:10 scheduler:771] select Project1:c33b9c4ffb85bae6d0cddfad4cf04d45 (url)
[I 160521 16:05:12 scheduler:771] select Project1:c0bd35f4b342b67bd853b5a2829dc2ab (url)
[I 160521 16:05:15 scheduler:771] select Project1:2833ef55e713f920e37a42b0f08b23f9 (url)
[I 160521 16:05:17 scheduler:771] select Project1:6307765509b193505ec0816651248672 (url)
[I 160521 16:05:20 scheduler:771] select Project1:6ff749186261cbca0144befd59acb95a (url)
[I 160521 16:05:22 scheduler:771] select Project1:8b67d4061112ae278e9b6d75218a264a (url)
[I 160521 16:05:25 scheduler:771] select Project1:3bb62916cfc13e690cb94c3b6a9b76ce (url)
[I 160521 16:05:27 scheduler:771] select Project1:8706cb5ebb3610a3bf9c8429656479c7 (url)
[I 160521 16:05:30 scheduler:771] select Project1:6db41cea81286d943c0e61fa5abbb007 (url)
[I 160521 16:05:32 scheduler:771] select Project1:ffe8d447582c5096fd3f754fc592ffc3 (url)
[I 160521 16:05:35 scheduler:771] select Project1:86e9e5dc681ced9e403f5139feb21ee1 (url)
[I 160521 16:05:37 scheduler:771] select Project1:10b70689ecc36aa9417323a7ba36037f (url)
[I 160521 16:05:39 processor:199] process Project1:df0430ce50c18f3da10fc00bb1f71e9e (url) -> [200] len:184714 -> result:None fol:1 msg:0 err:None
[I 160521 16:05:39 scheduler:712] task done Project1:df0430ce50c18f3da10fc00bb1f71e9e (url)
[I 160521 16:05:39 scheduler:628] new task Project1:5d903c2ee90096818830dde199c0f6ec (url)
[I 160521 16:05:40 scheduler:771] select Project1:5d903c2ee90096818830dde199c0f6ec (url)
[I 160521 16:05:41 processor:199] process Project1:ab6fb9e60692e1334ce4025ed641d147 (url) -> [200] len:46400 -> result:None fol:0 msg:0 err:None
[I 160521 16:05:41 scheduler:712] task done Project1:ab6fb9e60692e1334ce4025ed641d147 (url)
[I 160521 16:05:41 tornado_fetcher:410] [200] Project1:5d903c2ee90096818830dde199c0f6ec (url) 0.20s
[I 160521 16:05:41 tornado_fetcher:410] [200] Project1:86e9e5dc681ced9e403f5139feb21ee1 (url) 0.21s
[I 160521 16:05:42 scheduler:771] select Project1:35cf5bee7d417a28bda039c9ba034f6b (url)
[I 160521 16:05:44 processor:199] process Project1:76347f38ef6cfa9f98ea52cc44bb1b1e (url) -> [200] len:46940 -> result:None fol:0 msg:0 err:None
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:6db41cea81286d943c0e61fa5abbb007 (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:ffe8d447582c5096fd3f754fc592ffc3 (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:8b67d4061112ae278e9b6d75218a264a (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:6ff749186261cbca0144befd59acb95a (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:6307765509b193505ec0816651248672 (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:c0bd35f4b342b67bd853b5a2829dc2ab (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:c33b9c4ffb85bae6d0cddfad4cf04d45 (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:3bb62916cfc13e690cb94c3b6a9b76ce (url) 2.50s
[I 160521 16:05:44 tornado_fetcher:410] [200] Project1:8706cb5ebb3610a3bf9c8429656479c7 (url) 2.50s
[I 160521 16:05:44 scheduler:712] task done Project1:76347f38ef6cfa9f98ea52cc44bb1b1e (url)

解决方案

抓取频率与抓取是否完成没有关系,即使前一个任务还在抓取中,未完成,依旧会分发新的请求。

从日志可以看出,select 严格遵守了0.4的频率,每个 select 之间都超过2秒。
只不过有一个周期,可能由于网络原因,抓取一直没有完成,所以你只能看到 select。

  1. 对于封禁,一般都是检测一段时间内的请求数,我从未见过检测两个请求间隔的。所以在一定时间周期内,请求是密集在1秒内还是1分钟内区别不大。

  2. 请求大致分为建立连接,send,receive三个阶段。对于封禁的统计,send,receive,所花费的时间,对其没有意义。所以在流量控制时,并不需要考虑抓取的时间。而 connection 时间确实可能会在网络异常时,导致请求集中发出。我会修改这个地方的。

这篇关于pyspider - 为什么有时候会出现大量并发的fetch?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆