Scrapy 不按顺序抓取后续页面 [英] Scrapy not crawling subsequent pages in order

查看:47
本文介绍了Scrapy 不按顺序抓取后续页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个爬虫来从网站上获取项目的名称.该网站每页有 25 个项目和多个页面(某些项目类型为 200 个).

I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types).

代码如下:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from lonelyplanet.items import LonelyplanetItem

class LonelyplanetSpider(CrawlSpider):
    name = "lonelyplanetItemName_spider"
    allowed_domains = ["lonelyplanet.com"]
    def start_requests(self):
        for i in xrange(8):
            yield self.make_requests_from_url("http://www.lonelyplanet.com/europe/sights?page=%d" % i)

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//h2')
    items = []
    for site in sites:
        item = LonelyplanetItem()
        item['name'] = site.select('a[@class="targetUrl"]/text()').extract()
        items.append(item)
    return items

当我运行爬虫并以 csv 格式存储数据时,数据不是按顺序存储的,即 - 第 2 页数据存储在第 1 页或第 3 页之前存储在第 2 页之前,类似地.有时,在存储特定页面的所有数据之前,另一个页面的数据会进来,然后再存储前一页的其余数据.

When I run the crawler and store the data in csv format the data is not stored in order, i.e. - page 2 data is stored before page 1 or page 3 gets stored before page 2 and similarly. Also sometimes before all the data of a particular page is stored the data of another page comes in and them the rest of the data of the former page is stored again.

推荐答案

scrapy 是一个异步框架.它使用非阻塞 IO,因此在开始下一个请求之前不会等待请求完成.

scrapy is an asynchronous framework. It uses non-blocking IO, so it doesn't wait for a request to finish before starting the next one.

由于一次可以发出多个请求,因此不可能知道 parse() 方法将获得响应的确切顺序.

And since multiple requests can be made at a time, it is impossible to know the exact order the parse() method will be getting the responses.

我的观点是,scrapy 并不是要按特定顺序提取数据.如果你绝对需要保持秩序,这里有一些想法:按顺序抓取抓取网址

My point is, scrapy is not meant to extract data in a particular order. If you absolutely need to preserve order, there are some ideas here: Scrapy Crawl URLs in Order

这篇关于Scrapy 不按顺序抓取后续页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆