如何在 Scrapy 中实现自定义的 dupefilter? [英] How to implement a custom dupefilter in Scrapy?

查看:35
本文介绍了如何在 Scrapy 中实现自定义的 dupefilter?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个自定义的 DUPEFILTER_CLASS 在 Scrapy 中,但它似乎不起作用.这是我的示例蜘蛛:

I'm trying to create a custom DUPEFILTER_CLASS in Scrapy, but it seems to not be working. Here is my example Spider:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    custom_settings = {
                       'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter',
                       }

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

items.py 在哪里

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

dupefilter.py,在目录树中items.py旁边,是

and dupefilter.py, which is next to items.py in the directory tree, is

import logging
import redis
import scrapy.dupefilters

class RedisDupeFilter(scrapy.dupefilters.BaseDupeFilter):
    def __init__(self, server, key):
        self.server = server
        self.key = key
        self.logger = logging.getLogger(__name__)

    @classmethod
    def from_settings(cls, settings):
        server = redis.Redis()
        key = "URLs_seen"
        return cls(server=server, key=key)

    def request_seen(self, request):
        self.logger.debug("Checking whether request {request} has been seen yet...".format(request=request))
        added = self.server.sadd(self.key, request.url)
        return added == 0

在运行爬虫之前,我在命令行中使用 redis-server 启动了 Redis.然后在项目目录下,使用

Prior to running the spider, I started Redis using redis-server at the command line. Then in the project directory, I started the crawl using

scrapy crawl quotes

并观察以下日志输出:

2017-05-05 12:45:15 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-05-05 12:45:15 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2017-05-05 12:45:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-05-05 12:45:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-05 12:45:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-05 12:45:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-05 12:45:15 [scrapy.core.engine] INFO: Spider opened
2017-05-05 12:45:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-05 12:45:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-05 12:45:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2017-05-05 12:45:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2017-05-05 12:45:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Albert Einstein',
 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'],
 'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'J.K. Rowling',
 'tags': [u'abilities', u'choices'],
 'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Albert Einstein',
 'tags': [u'inspirational', u'life', u'live', u'miracle', u'miracles'],
 'text': u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Jane Austen',
 'tags': [u'aliteracy', u'books', u'classic', u'humor'],
 'text': u'\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Marilyn Monroe',
 'tags': [u'be-yourself', u'inspirational'],
 'text': u"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Albert Einstein',
 'tags': [u'adulthood', u'success', u'value'],
 'text': u'\u201cTry not to become a man of success. Rather become a man of value.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Andr\xe9 Gide',
 'tags': [u'life', u'love'],
 'text': u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Thomas A. Edison',
 'tags': [u'edison', u'failure', u'inspirational', u'paraphrased'],
 'text': u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Eleanor Roosevelt',
 'tags': [u'misattributed-eleanor-roosevelt'],
 'text': u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': u'Steve Martin',
 'tags': [u'humor', u'obvious', u'simile'],
 'text': u'\u201cA day without sunshine is like, you know, night.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Marilyn Monroe',
 'tags': [u'friends',
          u'heartbreak',
          u'inspirational',
          u'life',
          u'love',
          u'sisters'],
 'text': u"\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'J.K. Rowling',
 'tags': [u'courage', u'friends'],
 'text': u'\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Albert Einstein',
 'tags': [u'simplicity', u'understand'],
 'text': u"\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Bob Marley',
 'tags': [u'love'],
 'text': u"\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Dr. Seuss',
 'tags': [u'fantasy'],
 'text': u'\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Douglas Adams',
 'tags': [u'life', u'navigation'],
 'text': u'\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Elie Wiesel',
 'tags': [u'activism',
          u'apathy',
          u'hate',
          u'indifference',
          u'inspirational',
          u'love',
          u'opposite',
          u'philosophy'],
 'text': u"\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d"}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Friedrich Nietzsche',
 'tags': [u'friendship',
          u'lack-of-friendship',
          u'lack-of-love',
          u'love',
          u'marriage',
          u'unhappy-marriage'],
 'text': u'\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Mark Twain',
 'tags': [u'books', u'contentment', u'friends', u'friendship', u'life'],
 'text': u'\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d'}
2017-05-05 12:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': u'Allen Saunders',
 'tags': [u'fate',
          u'life',
          u'misattributed-john-lennon',
          u'planning',
          u'plans'],
 'text': u'\u201cLife is what happens to us while we are making other plans.\u201d'}
2017-05-05 12:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-05 12:45:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 675,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5976,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 5, 5, 10, 45, 15, 881028),
 'item_scraped_count': 20,
 'log_count/DEBUG': 24,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2017, 5, 5, 10, 45, 15, 553906)}
2017-05-05 12:45:15 [scrapy.core.engine] INFO: Spider closed (finished)

令我困惑的是,我在 self.logger 中输入的 正在检查请求...是否已经被看到..." 在日志中没有看到任何地方.debug 调用 RedisDupeFilterrequest_seen 方法.简而言之,我看不到 dupefilter 是否实际工作的确认.(可能不是,因为如果我再次运行 crawl,我会看到相同的输出.

What puzzles me is that I don't see anywhere in the logs "Checking whether request ... has been seen yet..." as I put in the self.logger.debug call in the request_seen method of the RedisDupeFilter. In short, I see no confirmation whether the dupefilter is actually working. (It's probably not, because if I run the crawl again I see the same output).

如何让我的自定义 DUPEFILTER_CLASS 工作?

How can I get my custom DUPEFILTER_CLASS to work?

推荐答案

根据 paul trmbrth 的评论,我没有使用 start_urls 类变量,而是像 Scrapy 教程:

Following paul trmbrth's comment, instead of using the start_urls class variable I overrode the start_requests method as in the Scrapy Tutorial:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    custom_settings = {'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter'}

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

这使用了一个事实,即 Request 对象的构造函数中 dont_filter 的默认值是 False(参见 https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects).

This uses the fact that the default value for dont_filter in the constructor of a Request object is False (cf. https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects).

dupefilter 现在可以工作了(假设 redis 服务器在后台运行):第二次我 scrapy crawl 引用 什么都没有被抓取.

The dupefilter now works (given that a redis server is running in the background): the second time I scrapy crawl quotes nothing gets scraped.

这篇关于如何在 Scrapy 中实现自定义的 dupefilter?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆