Scrapy LinkExtractor-限制每个URL抓取的页面数 [英] Scrapy LinkExtractor - Limit the number of pages crawled per URL

查看:504
本文介绍了Scrapy LinkExtractor-限制每个URL抓取的页面数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Scrapy的CrawlSpider中限制每个URL的已爬网页面数.我有一个start_urls列表,我想对每个URL中要爬网的页面数设置一个限制.达到限制后,抓取工具应移至下一个start_url.

I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the limit is reached, the spider should move to the next start_url.

我知道设置上有DEPTH_LIMIT参数,但这不是我想要的.

I know there is the DEPTH_LIMIT parameter on setting but this is not what I am looking for.

任何帮助都是有用的.

这是我当前拥有的代码:

Here is the code I currently have:

class MySpider(CrawlSpider):
    name = 'test'
    allowed_domains = domainvarwebsite
    start_urls = httpvarwebsite

    rules = [Rule(LinkExtractor(),
             callback='parse_item',
             follow=True)
            ]

    def parse_item(self, response):
        #here I parse and yield the items I am interested in.

编辑

我尝试实现此功能,但得到exceptions.SyntaxError: invalid syntax (filter_domain.py, line 20).有什么想法吗?

I have tried to implement this, but I get exceptions.SyntaxError: invalid syntax (filter_domain.py, line 20) . Any ideas of what is going on?

再次感谢.

filter_domain.py

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequest

class FilterDomainbyLimitMiddleware(object):
def __init__(self, domains_to_filter):
    self.domains_to_filter = domains_to_filter
    self.counter = defaultdict(int)

@classmethod
def from_crawler(cls, crawler):
    settings = crawler.settings
    spider_name = crawler.spider.name
    max_to_filter = settings.get('MAX_TO_FILTER')
    o = cls(max_to_filter)
    return o

def process_request(self, request, spider):
    parsed_url = urlparse.urlparse(request.url)
    (LINE 20:) if self.counter.get(parsed_url.netloc, 0) < self.max_to_filter[parsed_url.netloc]):
        self.counter[parsed_url.netloc] += 1
    else:
        raise IgnoreRequest()

settings.py

MAX_TO_FILTER = 30

DOWNLOADER_MIDDLEWARES = {
    'myproject.filter_domain.FilterDomainbyLimitMiddleware' :400,

}

推荐答案

Scrapy不直接提供此功能,但是您可以创建自定义中间件,如下所示:

Scrapy doesn't offer this directly, but you could create a custom Middleware, something like this:

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequest

class FilterDomainbyLimitMiddleware(object):
    def __init__(self, domains_to_filter):
        self.domains_to_filter = domains_to_filter
        self.counter = defaultdict(int)

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        spider_name = crawler.spider.name
        domains_to_filter = settings.get('DOMAINS_TO_FILTER')
        o = cls(domains_to_filter)
        return o

    def process_request(self, request, spider):
        parsed_url = urlparse.urlparse(request.url)
        if parsed_url.netloc in self.domains_to_filter:
            if self.counter.get(parsed_url.netloc, 0) < self.domains_to_filter[parsed_url.netloc]):
                self.counter[parsed_url.netloc] += 1
            else:
                raise IgnoreRequest()

,并在以下设置中声明DOMAINS_TO_FILTER:

and declaring the DOMAINS_TO_FILTER in settings like:

DOMAINS_TO_FILTER = {
    'mydomain': 5
}

仅接受来自该域的5个请求.另外,请记住在指定的这里

to only accept 5 requests from that domain. Also remember to enable the middleware in settings like specified here

这篇关于Scrapy LinkExtractor-限制每个URL抓取的页面数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆