Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作 [英] Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

查看：50 发布时间：2021/7/16 21:53:23 python scrapy web-crawler

本文介绍了Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用scrapy 1.0.3 并且无法发现CLOSESPIDER 扩展是如何工作的.对于命令:scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1正确是一个请求，但对于两页计数:scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2是无限的请求.

I use scrapy 1.0.3 and can't discover how works CLOSESPIDER extesnion. For command: scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1 is correctly one requst, but for two pages count: scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2 is infinity of requests.

所以请用简单的例子解释一下它是如何工作的.

So please explain me how it works in simple example.

这是我的蜘蛛代码:

class DomainLinksSpider(CrawlSpider):
    name = "domain_links"
    #allowed_domains = ["www.example.org"]
    start_urls = [ "www.example.org/",]

    rules = (

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'),
    )

    def parse_page(self, response):
        print '<<<',response.url
        items = []
        item = PathsSpiderItem()

        selected_links = response.selector.xpath('//a[@href]')

        for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response):
            item = PathsSpiderItem()
            item['url'] = link.url
            items.append(item)
        return items

甚至不为这个简单的蜘蛛工作:

even don't work for this simply spider:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['karen.pl']
    start_urls = ['http://www.karen.pl']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).


        # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()

        return item

但不是无穷大:

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=1'下载者/请求计数':1，

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=1 'downloader/request_count': 1,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=2'下载者/请求计数':17，

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=2 'downloader/request_count': 17,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=3'下载者/请求计数':19，

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=3 'downloader/request_count': 19,

可能是因为并行降级.是的，对于 CONCURRENT_REQUESTS = 1，CLOSESPIDER_PAGECOUNT 设置适用于第二个示例.我将检查第一个 - 它也有效.这对我来说几乎是无限的，因为带有许多网址(我的项目)的站点地图被抓取为下一页:)

Maby it is because of parallel downolading. Yes, for CONCURRENT_REQUESTS = 1, CLOSESPIDER_PAGECOUNT settings works for second example. I wil check the first - it works too. It was almost infinity for me becouse, sitemap with many urls (my items) was crawled as next page :)

Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作 [英] Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作 [英] Scrapy CLOSESPIDER_PAGECOUNT setting don&#39;t work as should

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作 [英] Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

登录关闭