Scrapy 不按顺序抓取后续页面 [英] Scrapy not crawling subsequent pages in order

查看：47 发布时间：2021/7/16 21:49:07 python web-crawler scrapy

本文介绍了Scrapy 不按顺序抓取后续页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写一个爬虫来从网站上获取项目的名称.该网站每页有 25 个项目和多个页面(某些项目类型为 200 个).

I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types).

代码如下:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from lonelyplanet.items import LonelyplanetItem

class LonelyplanetSpider(CrawlSpider):
    name = "lonelyplanetItemName_spider"
    allowed_domains = ["lonelyplanet.com"]
    def start_requests(self):
        for i in xrange(8):
            yield self.make_requests_from_url("http://www.lonelyplanet.com/europe/sights?page=%d" % i)

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//h2')
    items = []
    for site in sites:
        item = LonelyplanetItem()
        item['name'] = site.select('a[@class="targetUrl"]/text()').extract()
        items.append(item)
    return items

当我运行爬虫并以 csv 格式存储数据时，数据不是按顺序存储的，即 - 第 2 页数据存储在第 1 页或第 3 页之前存储在第 2 页之前，类似地.有时，在存储特定页面的所有数据之前，另一个页面的数据会进来，然后再存储前一页的其余数据.

When I run the crawler and store the data in csv format the data is not stored in order, i.e. - page 2 data is stored before page 1 or page 3 gets stored before page 2 and similarly. Also sometimes before all the data of a particular page is stored the data of another page comes in and them the rest of the data of the former page is stored again.

Scrapy 不按顺序抓取后续页面 [英] Scrapy not crawling subsequent pages in order

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 不按顺序抓取后续页面 [英] Scrapy not crawling subsequent pages in order

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭