从多个 start_url 中顺序抓取导致解析错误 [英] Sequential scraping from multiple start_urls leading to error in parsing

查看:99
本文介绍了从多个 start_url 中顺序抓取导致解析错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,非常感谢您回答此类菜鸟问题的所有工作.

First, highest appreciation for all of your work answering noob questions like this one.

其次,因为这似乎是我发现(IMO)相关问题的一个很常见的问题,例如:Scrapy:等待特定的 url在解析别人之前先解析

Second, as it seems to be a quite common problem I was finding (IMO) related questions such as: Scrapy: Wait for a specific url to be parsed before parsing others

但是,根据我目前的理解,在我的具体情况下调整建议并不容易,我非常感谢您的帮助.

However, at my current state of understanding it is not straightforward to adapt the suggestions in my specific case and I would really appreciate your help.

问题大纲:运行于(Python 3.7.1,Scrapy 1.5.1)

Problem Outline: running on (Python 3.7.1, Scrapy 1.5.1)

我想从这样的页面上收集的每个链接中抓取数据https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1

I want to scrape data from every link collected on pages like this https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1

然后来自另一个集合的所有链接

then from all links on another collection

https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650

如果我一次运行一个蜘蛛(例如第 1 页 650),我设法获得所需的信息(此处仅显示两个元素).(请注意,我将每页抓取的链接长度重新设置为 2.)但是,一旦我有多个 start start_urls(在下面的代码中设置列表 [1,650] 中的两个元素),解析的数据就不再一致了.显然,xpath 至少找不到一个元素.我怀疑我如何处理/传递导致不符合解析顺序的请求的一些(或很多)不正确的逻辑.

I manage to get the desired information (only two elements shown here) if I run the spider for one (e.g. page 1 or 650) at a time. (Note that I restircted the length of links that is crawled per page to 2.) However, once I have multiple start start_urls (setting two elements in the list [1,650] in the code below) the parsed data is no more consistent. Apparently at least one element is not found by xpath. I am suspecting some (or a lot of) incorrect logic how I handle/pass the requests that leads not to the intendet order for parsing.

代码:

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }    
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            print('#### START REQUESTS: ',url)
            yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

命令提示符

2019-03-18 15:42:27 [scrapy.core.engine] INFO: Spider opened
2019-03-18 15:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-18 15:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
2019-03-18 15:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1> (referer: None)
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
##### PARSING:  /gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort
##### PARSING:  /gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn
2019-03-18 15:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650> (referer: None)
##### PARSING:  /gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue
##### PARSING:  /gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule

2019-03-18 15:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Blinnenhorn/Corno Cieco
Comments:  Am Samstag Aufstieg zur Corno Gries Hütte, ca. 2,5h ab All Acqua. Zustieg problemslos auf guter Spur. Zur Verwunderung waren wir die einzigsten auf der Hütte. Danke an Monika für die herzliche Bewirtung...
Length of dict extracted from Route: 27

2019-03-18 15:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Cima Portule
Comments:  Sehr viel Schnee in dieser Gegend und viel Spirarbeit geleiset, deshalb auch viel Zeit gebraucht.
Length of dict extracted from Route: 19

2019-03-18 15:42:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Schwändeliflue
Comments:  Wege und Pfade meist schneefrei, da im Gebiet viel Hochmoor ist, z.t. sumpfig.  Oberhalb 1600m und in Schattenlagen bis 1400m etwas Schnee  (max.Schuhtief).  Wetter sonnig und sehr warm für die Jahreszeit, T-Shirt - Wetter,  Frühlingshaft....
Length of dict extracted from Route: 17

2019-03-18 15:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Beaufort
2019-03-18 15:42:40 [scrapy.core.scraper] **ERROR: Spider error processing <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
Traceback (most recent call last):
  File "C:\Users\Lenovo\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\Lenovo\Dropbox\Code\avalanche\scrapy\slf1\slf1\spiders\slf_spider1.py", line 38, in parse_gipfelbuch_item
    print('Comments: ', fields['Verhältnis-Beschreibung'])
**KeyError: 'Verhältnis-Beschreibung'****
2019-03-18 15:42:40 [scrapy.core.engine] INFO: Closing spider (finished)

问题:我如何正确构建第一个(用于链接)和第二个(用于内容)解析命令?为什么解析输出"不是我期望的顺序(首先是第 1 页,从上到下链接,然后是第 2 页,从上到下链接)?

Question: How do I have to structure the first (for links) and second (for content) parsing commands correctly? Why is the "PARSE OUTPUT" not in the order i would expect (first for page 1, links top to bottom, then page 2, links top to bottom)?

我已经尝试减少 CONCURRENT_REQUESTS = 1 和 DOWNLOAD_DELAY = 2 的数量.

I already tried to reduce the number of CONCURRENT_REQUESTS = 1 and DOWNLOAD_DELAY = 2.

我希望这个问题足够清楚...提​​前非常感谢.

I hope the question is clear enough... big thanks in advance.

推荐答案

如果问题是同时访问多个网址,可以一个一个访问,使用信号spider_idle(https://docs.scrapy.org/en/latest/topics/signals.html).

If the problem is to visit more URLs at the same time, you can visit one by one, using the signal spider_idle (https://docs.scrapy.org/en/latest/topics/signals.html).

思路如下:

1.start_requests 只访问第一个 URL

1.start_requests only visits the first URL

2.蜘蛛空闲时调用spider_idle方法

2.when the spider gets idle, the method spider_idle is called

3.spider_idle方法删除第一个网址,访问第二个网址

3.the method spider_idle deletes the first URL and visits the second URL

4.等等...

代码大概是这样的(我没试过):

The code would be something like this (I didn't try it):

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }   
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(SlfSpider1Spider, cls).from_crawler(crawler, *args, **kwargs)
        # Here you set which method the spider has to run when it gets idle
        crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
        return spider

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        # the spider visits only the first provided URL
        url = self.start_urls[0]:
        print('#### START REQUESTS: ',url)
        yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

    # When the spider gets idle, it deletes the first url and visits the second, and so on...
    def spider_idle(self, spider):
        del(self.start_urls[0])
        if len(self.start_urls)>0:
            url = self.start_urls[0]
            self.crawler.engine.crawl(Request(url, callback=self.parse_verhaeltnisse, dont_filter=True), spider)

这篇关于从多个 start_url 中顺序抓取导致解析错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆