如何根据scrapy中的url过滤重复请求 [英] how to filter duplicate requests based on url in scrapy
问题描述
我正在使用带有 CrawlSpider 的 scrapy 为网站编写爬虫.
Scrapy 提供了一个内置的重复请求过滤器,它根据 url 过滤重复请求.此外,我可以使用 CrawlSpider 的 rules 成员过滤请求.
我想做的是过滤请求,例如:
http:://www.abc.com/p/xyz.html?id=1234&refer=5678
如果我已经访问过
http:://www.abc.com/p/xyz.html?id=1234&refer=4567
<块引用>
注意:引用是一个不会影响我得到的响应的参数,所以我不关心该参数的值是否发生变化.
现在,如果我有一个集合所有 ids 的集合,我可以在我的回调函数 parse_item(这是我的回调函数)中忽略它来实现这个功能.>
但这意味着我至少在不需要时仍在获取该页面.
那么我可以通过什么方式告诉scrapy它不应该根据url发送特定请求?
你可以编写自定义中间件进行重复删除并在设置中添加
导入操作系统从 scrapy.dupefilter 导入 RFPDupeFilter类自定义过滤器(RFPDupeFilter):"""考虑 url 中特定 id 的欺骗过滤器"""def __getid(self, url):mm = url.split("&refer")[0] #或类似的东西返回毫米def request_seen(self, request):fp = self.__getid(request.url)如果 fp 在 self.fingerprints 中:返回真self.fingerprints.add(fp)如果 self.file:self.file.write(fp + os.linesep)
然后需要在settings.py中设置正确的DUPFILTER_CLASS
DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'
它应该在那之后工作
I am writing a crawler for a website using scrapy with CrawlSpider.
Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.
What I want to do is to filter requests like:
http:://www.abc.com/p/xyz.html?id=1234&refer=5678
If I have already visited
http:://www.abc.com/p/xyz.html?id=1234&refer=4567
NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.
Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.
But that would mean I am still at least fetching that page, when I don't need to.
So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?
You can write custom middleware for duplicate removal and add it in settings
import os
from scrapy.dupefilter import RFPDupeFilter
class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
Then you need to set the correct DUPFILTER_CLASS in settings.py
DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'
It should work after that
这篇关于如何根据scrapy中的url过滤重复请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!