使用scrapy递归抓取站点 [英] Crawling a site recursively using scrapy

查看:22
本文介绍了使用scrapy递归抓取站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scrapy 废弃一个网站.

I am trying to scrap a site using scrapy.

这是我目前基于 http://编写的代码thuongnh.com/building-a-web-crawler-with-scrapy/(原始代码根本不起作用,所以我尝试重建它)

This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at all so I tried to rebuild it)

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders             import Spider
from scrapy.selector         import HtmlXPathSelector
from nettuts.items            import NettutsItem
from scrapy.http            import Request
from scrapy.linkextractors import LinkExtractor


class MySpider(Spider):
    name = "nettuts"
    allowed_domains = ["net.tutsplus.com"]
    start_urls = ["http://code.tutsplus.com/posts?"]
    rules = [Rule(LinkExtractor(allow = ('')), callback = 'parse', follow = True)]

    def parse(self, response):
        hxs  = HtmlXPathSelector(response)
        item = []

        titles    = hxs.xpath('//li[@class="posts__post"]/a/text()').extract()
        for title in titles:
            item             = NettutsItem()
            item["title"]     = title
            yield item
        return

问题是爬虫进入了起始页面,但之后没有抓取任何页面.

Problem is that crawler goes to the start page but does not scrap any pages after that.

推荐答案

以下是一个不错的开始:

Following can be a good idea to start with:

使用scrapy 递归抓取网站"可以有两个用例.

There can be two use cases for 'Crawling a site recursively using scrapy'.

A).我们只想使用表格的分页按钮在网站上移动并获取数据.这相对简单.

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)`

观察最后 4 行.这里

Observe the last 4 lines. Here

  1. 我们正在从下一页"分页按钮获取下一页 xpath 的下一页链接.
  2. if 条件检查,如果它不是分页的结尾.
  3. 使用 urljoin 将此链接(我们在第 1 步中获得的)与主 url 连接
  4. 对解析"回调方法的递归调用.

B) 我们不仅要跨页面移动,还要从该页面中的一个或多个链接中提取数据.

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
        '''do your parsing here'''

在这里,注意:

  1. 我们使用的是scrapy.Spider"父类的CrawlSpider"子类

  1. We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class

我们已设置为规则"

a) 第一个规则只是检查是否有可用的next_page"并遵循它.

a) The first rule just checks if there is a 'next_page' available and follows it.

b) 第二条规则请求页面上所有格式为/trains/12343"的链接,然后调用parse_trains"来执行和解析操作.

b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.

重要:请注意,我们不想在这里使用常规的解析"方法,因为我们使用的是CrawlSpider"子类.这个类还有一个解析"方法,所以我们不想覆盖它.请记住将您的回调方法命名为解析"以外的其他名称.

Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.

这篇关于使用scrapy递归抓取站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆