如何使用scrapy框架抓取二级网站 [英] How can I crawl the two level website with scrapy framework

查看:109
本文介绍了如何使用scrapy框架抓取二级网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个有两级网址的网站,第一级是多页列表,网址如下:

I want crawl a website which has two level url,the first level is a muti-page list ,url like this:

http://www.example.com/group/p{n}/

页面布局如下:

  • 列表项链接 1
  • 列表项链接 2
  • 列表项链接 3
  • 列表项链接 4

1,2,3,4,5 ... 下一页

1,2,3,4,5 ... nextpage

第二层是详情页,网址如下:

and the second level is a detail page,url like this:

http://www.example.com/group/view/{n}/

我的蜘蛛代码是:

import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin

class MyCrawler(CrawlSpider):
    name = "AnjukeCrawler"

    start_urls=[
        "http://www.example.com/group/"
    ]

    rules = [
        Rule(LxmlLinkExtractor(allow=(),
                               restrict_xpaths=(["//div[@class='multi-   page']/a[@class='aNxt']"])),
                               callback='parse_list_page',
                               follow=True)
    ]

    def parse_list_page(self, response):

        list_page=response.xpath("//div[@class='li-  itemmod']/div/h3/a/@href").extract()

        for item in list_page:
            yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)


    def parse_detail_page(self,response):

        community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()

        self.log(community_name,2)  

我的问题是:我的 parse_detail_page 似乎从来没有运行过,有人能告诉我为什么吗?我该如何解决?

谢谢!

推荐答案

你不应该覆盖 CrawlSpiderparse 方法,因为它包含了这类蜘蛛的核心解析逻辑,所以你的 def parse( 应该是 def parse_list_page( - 那个错字是你的问题.

You should never overwrite parse method of CrawlSpider because it contains core parsing logic for this type of spiders, so your def parse( should be def parse_list_page( - and that typo is your issue.

然而,由于使用回调和 follow=True 只是为了提取链接,您的规则看起来像开销,最好考虑使用规则列表并像这样重写您的蜘蛛:

However your rule looks like overhead because of using both callback and follow=True just to extract links, it is better to consider using of the list of rules and rewrite your spider like this:

class MyCrawler(CrawlSpider):
    name = "AnjukeCrawler"

    start_urls = [
        "http://www.example.com/group/"
    ]

    rules = [
        Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='multi-page']/a[@class='aNxt']"),
             follow=True),
        Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='li-itemmod']/div/h3/a/@href"),
             callback='parse_detail_page'),
    ]

    def parse_detail_page(self, response):
        community_name = response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()
        self.log(community_name, 2)

顺便说一句,链接提取器中的括号太多:restrict_xpaths=(["//div[@class='multi-page']/a[@class='aNxt']"])

BTW, too many brackets in the link extractor: restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])

这篇关于如何使用scrapy框架抓取二级网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆