如何强制scrapy抓取重复的url? [英] How to force scrapy to crawl duplicate url?

查看:37
本文介绍了如何强制scrapy抓取重复的url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Scrapy 一个网络爬虫框架.
默认情况下,它不会抓取重复的 url 或scrapy 已经抓取的 url.

I am learning Scrapy a web crawling framework.
by default it does not crawl duplicate urls or urls which scrapy have already crawled.

如何让Scrapy抓取重复的url或者已经抓取过的url?
我试图在互联网上找到,但找不到相关帮助.

How to make Scrapy to crawl duplicate urls or urls which have already crawled?
I tried to find out on internet but could not find relevant help.

我从 Scrapy - Spider 抓取重复的 url 但这个问题与我正在寻找的相反

I found DUPEFILTER_CLASS = RFPDupeFilter and SgmlLinkExtractor from Scrapy - Spider crawls duplicate urls but this question is opposite of what I am looking

推荐答案

您可能正在寻找 Request() 上的 dont_filter=True 参数.见 http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

You're probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

这篇关于如何强制scrapy抓取重复的url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆