如何使用 Scrapy 从站点递归抓取每个链接? [英] How to recursively scrape every link from a site using Scrapy?

查看:44
本文介绍了如何使用 Scrapy 从站点递归抓取每个链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 从网站获取每个链接(没有其他数据).我想通过从主页开始,从那里抓取所有链接来做到这一点,然后对于找到的每个链接,跟随链接并从该页面抓取所有(唯一)链接,并对找到的所有链接执行此操作,直到没有更多链接关注.

I'm trying to obtain every single link (and no other data) from a website using Scrapy. I want to do this by starting at the homepage, scraping all the links from there, then for each link found, follow the link and scrape all (unique) links from that page, and do this for all links found until there are no more to follow.

我还必须输入用户名和密码才能进入网站的每个页面,因此我在 start_requests 中包含了一个基本的身份验证组件.

I also have to enter a username and password to get into each page on the site, so I've included a basic authentication component to my start_requests.

到目前为止,我有一个蜘蛛,它只给我主页上的链接,但是我似乎无法弄清楚为什么它不跟随链接并抓取其他页面.

So far I have a spider which gives me the links on the homepage only, however I can't seem to figure out why it's not following the links and scraping other pages.

这是我的蜘蛛:

    from examplesite.items import ExamplesiteItem
    import scrapy
    from scrapy.linkextractor import LinkExtractor
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy import Request
    from w3lib.http import basic_auth_header
    from scrapy.crawler import CrawlerProcess

    class ExampleSpider(CrawlSpider):
#name of crawler
name = "examplesite"

#only scrape on pages within the example.co.uk domain
allowed_domains = ["example.co.uk"]

#start scraping on the site homepage once credentials have been authenticated
def start_requests(self):
    url = str("https://example.co.uk")
    username = "*********"
    password = "*********"
    auth = basic_auth_header(username, password)
    yield scrapy.Request(url=url,headers={'Authorization': auth})

#rules for recursively scraping the URLS found
rules = [
    Rule(
        LinkExtractor(
            canonicalize=True,
            unique=True
        ),
        follow=True,
        callback="parse"
    )
]

#method to identify hyperlinks by xpath and extract hyperlinks as scrapy items
def parse(self, response):
    for element in response.xpath('//a'):
        item = ExamplesiteItem()
        oglink = element.xpath('@href').extract()
        #need to add on prefix as some hrefs are not full https URLs and thus cannot be followed for scraping
        if "http" not in str(oglink):
            item['link'] = "https://example.co.uk" + oglink[0]
        else:
            item['link'] = oglink

        yield item

这是我的物品类:

    from scrapy import Field, Item

    class ExamplesiteItem(Item):
        link = Field()

我认为我出错的地方是规则",我知道您需要按照链接进行操作,但我不完全了解它的工作原理(已尝试在线阅读多种解释,但仍然没有当然).

I think the bit I'm going wrong is the "Rules", which I'm aware you need to follow the links, but I don't fully understand how it works (have tried reading several explanations online but still not sure).

任何帮助将不胜感激!

推荐答案

你的规则没问题,问题是覆盖了 parse 方法.

Your rules are fine, the problem is overriding the parse method.

来自 https://doc 的 scrapy 文档.scrapy.org/en/latest/topics/spiders.html#crawling-rules

在编写爬虫规则时,避免使用parse作为回调,由于 CrawlSpider 使用 parse 方法本身来实现它的逻辑.所以如果你重写 parse 方法,爬行蜘蛛将不再起作用.

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

这篇关于如何使用 Scrapy 从站点递归抓取每个链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆