Scrapy 如何过滤爬取的 url? [英] How Scrapy filters the crawled urls?

查看:77
本文介绍了Scrapy 如何过滤爬取的 url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道 Scrapy 是如何过滤那些爬取到的 url 的?它是否存储在诸如 crawled_urls_list 之类的东西中抓取的所有 url,并且当它获得一个新的 url 时,它会查找列表以检查该 url 是否存在?

I want to know how Scrapy filters those crawled urls? Does it store all urls which are crawled in something like crawled_urls_list, and when it get a new url it looks up the list to check if the url exists ?

CrawlSpider(/path/to/scrapy/contrib/spiders/crawl.py)这个过滤部分的代码在哪里?

Where are the codes of this filtering part of CrawlSpider(/path/to/scrapy/contrib/spiders/crawl.py) ?

非常感谢!

推荐答案

默认情况下,scrapy 会保留所见请求的指纹.这个列表保存在内存中的 Python 集合中,并在 JOBDIR 变量定义的目录中附加了一个文件调用 requests.seen.如果您重新启动scrapy,该文件将重新加载到python 集中.控制这个的类在scrapy.dupefilter中如果您需要不同的行为,您可以重载此类.

By default scrapy keep a fingerprint of seen requests. This list is kept in memory in a python set and appended a file call requests.seen in the directory defined by the JOBDIR variable. If you restart scrapy the file is reloaded into the python set. The class that control this is in scrapy.dupefilter You can overload this class if you need a different behaviour.

这篇关于Scrapy 如何过滤爬取的 url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆