如何强制scrapy抓取重复的网址? [英] How to force scrapy to crawl duplicate url?
问题描述
我正在学习 Scrapy 一个网络爬虫框架.
默认情况下,它不会抓取重复的网址或已抓取的网址.
I am learning Scrapy a web crawling framework.
by default it does not crawl duplicate urls or urls which scrapy have already crawled.
如何让Scrapy抓取重复的url或者已经抓取过的url?
我试图在互联网上找到,但找不到相关帮助.
How to make Scrapy to crawl duplicate urls or urls which have already crawled?
I tried to find out on internet but could not find relevant help.
我从 Scrapy - Spider 抓取重复的 url 但这个问题与我正在寻找的相反
I found DUPEFILTER_CLASS = RFPDupeFilter
and SgmlLinkExtractor
from Scrapy - Spider crawls duplicate urls but this question is opposite of what I am looking
推荐答案
您可能正在寻找 Request()
上的 dont_filter=True
参数.见 http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
You're probably looking for the dont_filter=True
argument on Request()
.
See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
这篇关于如何强制scrapy抓取重复的网址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!