在 Scrapy 中使用经过身份验证的会话爬行 [英] Crawling with an authenticated session in Scrapy

查看:71
本文介绍了在 Scrapy 中使用经过身份验证的会话爬行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的上一个问题中,我对我的问题不是很具体(使用 Scrapy 进行经过身份验证的会话抓取),希望能够从更一般的答案中推导出解决方案.我可能更应该使用 crawling 这个词.

所以,这是我目前的代码:

class MySpider(CrawlSpider):名称 = '我的蜘蛛'allowed_domains = ['domain.com']start_urls = ['http://www.domain.com/login/']规则 = (Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),)定义解析(自我,响应):hxs = HtmlXPathSelector(响应)如果在 response.body 中不是嗨赫尔曼":返回 self.login(response)别的:返回 self.parse_item(response)定义登录(自我,响应):返回 [FormRequest.from_response(response,formdata={'name': 'herman', 'password': 'password'},回调=self.parse)]def parse_item(self, response):i['url'] = response.url# ...做更多的事情返回我

如您所见,我访问的第一个页面是登录页面.如果我尚未通过身份验证(在 parse 函数中),我会调用我的自定义 login 函数,该函数将发布到登录表单.然后,如果我通过身份验证,我想继续爬行.

问题是我为了登录而尝试覆盖的 parse 函数现在不再进行必要的调用来抓取任何其他页面(我假设).而且我不确定如何保存我创建的项目.

以前有人做过这样的事情吗?(使用 CrawlSpider 进行身份验证,然后抓取)任何帮助将不胜感激.

解决方案

不要覆盖 CrawlSpider 中的 parse 函数:

当您使用 CrawlSpider 时,您不应覆盖 parse 函数.此处的 CrawlSpider 文档中有一个警告:http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

这是因为对于 CrawlSpiderparse(任何请求的默认回调)发送响应以由 Rules 处理.

<小时>

在抓取前登录:

为了在蜘蛛开始爬行之前进行某种初始化,您可以使用 InitSpider(继承自 CrawlSpider),并覆盖 init_request 函数.这个函数会在蜘蛛初始化的时候调用,在它开始爬行之前.

为了让 Spider 开始爬行,您需要调用 self.initialized.

你可以阅读负责这个的代码在这里(它有有用的文档字符串).

<小时>

示例:

from scrapy.contrib.spiders.init import InitSpider从scrapy.http导入请求,FormRequest从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractor从scrapy.contrib.spiders 导入规则类 MySpider(InitSpider):名称 = '我的蜘蛛'allowed_domains = ['example.com']login_page = 'http://www.example.com/login'start_urls = ['http://www.example.com/useful_page/','http://www.example.com/another_useful_page/']规则 = (规则(SgmlLinkExtractor(allow=r'-\w+.html$'),callback='parse_item', follow=True),)定义 init_request(self):"""该函数在爬行开始前调用."""返回请求(url=self.login_page,回调=self.login)定义登录(自我,响应):"""生成登录请求."""返回 FormRequest.from_response(response,formdata={'name': 'herman', 'password': 'password'},回调=self.check_login_response)def check_login_response(self, response):"""检查登录请求返回的响应,看看我们是否成功登录."""如果在 response.body 中嗨赫尔曼":self.log("登录成功,开始爬行吧!")# 现在可以开始爬行了..返回 self.initialized()别的:self.log("糟糕的时光:(")# 出了点问题,我们无法登录,所以什么也没发生.def parse_item(self, response):# 从页面中抓取数据

<小时>

保存项目:

您的 Spider 返回的项目将传递给管道,管道负责对数据执行您想做的任何操作.我建议您阅读文档:http://doc.scrapy.org/zh/0.14/topics/item-pipeline.html

如果您对 Item 有任何问题/疑问,请随时提出新问题,我会尽力提供帮助.

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

解决方案

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.


Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).


An example:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com/login'
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page


Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

这篇关于在 Scrapy 中使用经过身份验证的会话爬行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆