Scrapy 在几页后停止爬行 [英] Scrapy stops crawling after a few pages

查看：43 发布时间：2021/7/16 21:52:48 python web-scraping web-crawler scrapy

本文介绍了Scrapy 在几页后停止爬行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只是在学习 Scrapy 和网站爬虫的基础知识，所以非常感谢您的意见.在教程的指导下，我从 Scrapy 构建了一个简单明了的爬虫.

I'm just picking up the basics of Scrapy and website crawlers so I would really appreciate your input. I've built a plain and simple crawler from Scrapy, guided by a tutorial.

它工作正常，但它不会像它应该的那样抓取所有页面.

It works fine but it won't crawl all the pages as it should.

我的蜘蛛代码是:

from scrapy.spider       import BaseSpider
from scrapy.selector     import HtmlXPathSelector
from scrapy.http.request import Request
from fraist.items        import FraistItem
import re

class fraistspider(BaseSpider):
    name = "fraistspider"
    allowed_domain = ["99designs.com"]
    start_urls = ["http://99designs.com/designer-blog/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//div[@class='pagination']/a/@href").extract()

        #We stored already crawled links in this list
        crawledLinks    = []

        #Pattern to check proper link
        linkPattern     = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")

        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            if linkPattern.match(link) and not link in crawledLinks:
                crawledLinks.append(link)
                yield Request(link, self.parse)

        posts = hxs.select("//article[@class='content-summary']")
        items = []
        for post in posts:
            item = FraistItem()
            item["title"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/text()").extract()
            item["link"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/@href").extract()
            item["content"] = post.select("div[@class='summary']/p/text()").extract()
            items.append(item)
        for item in items:
            yield item

输出为:

         'title': [u'Design a poster in the style of Saul Bass']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Scraped from <200 http://nnbdesig
ner.wpengine.com/designer-blog/>
        {'content': [u'Helping a company come up with a branding strategy can be
 exciting\xa0and intimidating, all at once. It gives a designer the opportunity
to make a great visual impact with a brand, but requires skills in logo, print a
nd digital design. If you\u2019ve been hesitating to join a 99designs Brand Iden
tity Pack contest, here are a... '],
         'link': [u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity-pack-design-success/'],
         'title': [u'99designs\u2019 tips for a successful Brand Identity Pack d
esign']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/10/
>
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/11/
>
2015-05-20 16:22:41+0100 [fraistspider] INFO: Closing spider (finished)
2015-05-20 16:22:41+0100 [fraistspider] INFO: Stored csv feed (100 items) in: da
ta.csv
2015-05-20 16:22:41+0100 [fraistspider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 4425,
         'downloader/request_count': 16,
         'downloader/request_method_count/GET': 16,
         'downloader/response_bytes': 126915,
         'downloader/response_count': 16,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 5,
         'dupefilter/filtered': 41,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 20, 15, 22, 41, 738000),
         'item_scraped_count': 100,
         'log_count/DEBUG': 119,
         'log_count/INFO': 8,
         'request_depth_max': 5,
         'response_received_count': 11,
         'scheduler/dequeued': 16,
         'scheduler/dequeued/memory': 16,
         'scheduler/enqueued': 16,
         'scheduler/enqueued/memory': 16,
         'start_time': datetime.datetime(2015, 5, 20, 15, 22, 40, 718000)}
2015-05-20 16:22:41+0100 [fraistspider] INFO: Spider closed (finished)

如您所见，'item_scraped_count' 是 100，但应该更多，因为总共有 122 页，每页 10 篇文章.

As you can see the 'item_scraped_count' is 100 although it should be much more since there are 122 pages in total, 10 articles per page.

从输出中我可以看到存在 301 重定向问题，但我不明白为什么这会导致问题.我尝试了另一种方法来重写我的蜘蛛代码，但在相同部分的几个条目之后它再次中断.

From the output I can see that there is a 301 redirect issue but I don't understand why is this causing problems. I've tried another approach rewriting my spider's code, but again it breaks after a few entries, around the same part.

任何帮助将不胜感激.谢谢！

Any help would be much appreciated. Thank you!

对于这种情况，我将使用 CrawlSpider抓取多个页面，所以你必须定义一个规则与 99designs.com 中的页面匹配，并修改您的解析函数以处理该项目.

For this case I'll go with an CrawlSpider to crawl multiple pages, so you have to define a rule that match the pages in 99designs.com and sightly modify your parse function to process the item.

C&P 来自 Scrapy 文档的示例代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

我刚刚发现这篇博文，其中包含一个有用的例子.

I just found this blog post that contain an useful example.

这篇关于Scrapy 在几页后停止爬行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy 在几页后停止爬行 [英] Scrapy stops crawling after a few pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 在几页后停止爬行 [英] Scrapy stops crawling after a few pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭