Scrapy 关注 &抓取下一页 [英] Scrapy Follow & Scrape next Pages

查看:59
本文介绍了Scrapy 关注 &抓取下一页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,我的爬虫蜘蛛都不会抓取网站,只会抓取一页并抓取.我的印象是 rules 成员变量对此负责,但我无法让它跟随任何链接.我一直在关注这里的文档:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

I am having a problem where none of my scrapy spiders will crawl a website, just scrape one page and seize. I was under the impression that the rules member variable was responsible for this, but I can't get it to follow any links. I have been following the documentation from here: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

是什么让我的机器人不爬行?

What could I be missing that is making none of my bots crawl?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector

from Example.items import ExItem

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.ac.uk"]
    start_urls = (
        'http://www.example.ac.uk',
    )

    rules = ( Rule (LinkExtractor(allow=("", ),),
                    callback="parse_items",  follow= True),
    )

推荐答案

用这个替换你的规则:

rules = ( Rule(LinkExtractor(allow=('course-finder', ),restrict_xpaths=('//div[@class="pagination"]',)), callback='parse_items',follow=True), )

这篇关于Scrapy 关注 &抓取下一页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆