如何根据scrapy中的url过滤重复请求 [英] how to filter duplicate requests based on url in scrapy

查看:262
本文介绍了如何根据scrapy中的url过滤重复请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有 CrawlSpider 的 scrapy 为网站编写爬虫.

Scrapy 提供了一个内置的重复请求过滤器,它根据 url 过滤重复请求.此外,我可以使用 CrawlSpider 的 rules 成员过滤请求.

我想做的是过滤请求,例如:

http:://www.abc.com/p/xyz.html?id=1234&refer=5678

如果我已经访问过

http:://www.abc.com/p/xyz.html?id=1234&refer=4567

<块引用>

注意:引用是一个不会影响我得到的响应的参数,所以我不关心该参数的值是否发生变化.

现在,如果我有一个集合所有 ids 的集合,我可以在我的回调函数 parse_item(这是我的回调函数)中忽略它来实现这个功能.>

但这意味着我至少在不需要时仍在获取该页面.

那么我可以通过什么方式告诉scrapy它不应该根据url发送特定请求?

解决方案

你可以编写自定义中间件进行重复删除并在设置中添加

导入操作系统从 scrapy.dupefilter 导入 RFPDupeFilter类自定义过滤器(RFPDupeFilter):"""考虑 url 中特定 id 的欺骗过滤器"""def __getid(self, url):mm = url.split("&refer")[0] #或类似的东西返回毫米def request_seen(self, request):fp = self.__getid(request.url)如果 fp 在 self.fingerprints 中:返回真self.fingerprints.add(fp)如果 self.file:self.file.write(fp + os.linesep)

然后需要在settings.py中设置正确的DUPFILTER_CLASS

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

它应该在那之后工作

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.

What I want to do is to filter requests like:

http:://www.abc.com/p/xyz.html?id=1234&refer=5678

If I have already visited

http:://www.abc.com/p/xyz.html?id=1234&refer=4567

NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.

Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.

But that would mean I am still at least fetching that page, when I don't need to.

So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?

解决方案

You can write custom middleware for duplicate removal and add it in settings

import os

from scrapy.dupefilter import RFPDupeFilter

class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""

    def __getid(self, url):
        mm = url.split("&refer")[0] #or something like that
        return mm

    def request_seen(self, request):
        fp = self.__getid(request.url)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

Then you need to set the correct DUPFILTER_CLASS in settings.py

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

It should work after that

这篇关于如何根据scrapy中的url过滤重复请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆