Scrapy 需要抓取网站上的所有下一个链接并移动到下一页 [英] Scrapy needs to crawl all the next links on website and move on to the next page

查看:88
本文介绍了Scrapy 需要抓取网站上的所有下一个链接并移动到下一页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要我的scrapy才能进入下一页,请给我正确的规则代码,怎么写??

I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

推荐答案

据我所知,您正在尝试抓取两种页面,因此您应该使用两个不同的规则:

From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :

  • 分页列表页面,包含到 n 个项目页面和后续列表页面的链接
  • 商品页面,您可以从中抓取商品
  • paginated list pages, containing links to n items pages and to subsequent list pages
  • items pages, from which you scrape your items

您的规则应该如下所示:

Your rules should then look something like :

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

说明:

  • 第一条规则匹配商品链接并使用您的商品解析方法 (parse_gen) 作为回调.生成的响应不会再次通过这些规则.
  • 第二条规则匹配pagelinks"并且没有指定回调,结果响应由这些规则处理.
  • The first rules matches item links and uses your item parsing method (parse_gen) as callback. The resulting responses do not go through these rules again.
  • the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules.

注意:

  • SgmlLinkExtractor 已过时,您应该使用 LxmlLinkExtractor(或其别名 LinkExtractor)代替()
  • 您发送请求的顺序确实很重要,在这种情况下(抓取未知的、可能大量的页面/项目),您应该设法减少在任何给定时间处理的页面数量时间.为此,我以两种方式修改了您的代码:
    • 在请求下一个之前从当前列表页面抓取项目,这就是项目规则在页面链接"规则之前的原因.
    • 避免多次抓取页面,这就是我将 [contains(text(), "Next")] 选择器添加到pagelinks"规则的原因.这样每个列表页面"都会被请求一次
    • SgmlLinkExtractor is obsolete and you should use LxmlLinkExtractor (or its alias LinkExtractor) instead (source)
    • The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. To this end I've modified your code in two ways :
      • scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule.
      • avoid crawling a page several times over, this is why I added the [contains(text(), "Next")] selector to the "pagelinks" rule. This way each "list page" gets requested exactly once

      这篇关于Scrapy 需要抓取网站上的所有下一个链接并移动到下一页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆