将 Scrapy 与经过身份验证(登录)的用户会话一起使用 [英] Using Scrapy with authenticated (logged in) user session

查看:52
本文介绍了将 Scrapy 与经过身份验证(登录)的用户会话一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scrapy 文档,有以下示例说明如何在 Scrapy 中使用经过身份验证的会话:

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:

class LoginSpider(BaseSpider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        # continue scraping with authenticated session...

我已经开始工作了,没问题.但是我的问题是:正如他们在最后一行的评论中所说的那样,您需要做什么才能继续使用经过身份验证的会话进行抓取?

I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?

推荐答案

在上面的代码中,用于验证的 FormRequestafter_login 函数设置为它的回调.这意味着将调用 after_login 函数并传递登录尝试获得的页面作为响应.

In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.

然后通过在页面中搜索特定字符串来检查您是否已成功登录,在本例中为 "authentication failed".如果找到它,蜘蛛就会结束.

It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.

现在,一旦蜘蛛走到这一步,它就知道它已成功通过身份验证,您可以开始生成新请求和/或抓取数据.所以,在这种情况下:

Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:

from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

# ...

def after_login(self, response):
    # check login succeed before going on
    if "authentication failed" in response.body:
        self.log("Login failed", level=log.ERROR)
        return
    # We've successfully authenticated, let's have some fun!
    else:
        return Request(url="http://www.example.com/tastypage/",
               callback=self.parse_tastypage)

def parse_tastypage(self, response):
    hxs = HtmlXPathSelector(response)
    yum = hxs.select('//img')

    # etc.

<小时>

如果你看 在这里,有一个蜘蛛在抓取之前进行身份验证的例子.


If you look here, there's an example of a spider that authenticates before scraping.

在这种情况下,它在 parse 函数(任何请求的默认回调)中处理事情.

In this case, it handles things in the parse function (the default callback of any request).

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    if hxs.select("//form[@id='UsernameLoginForm_LoginForm']"):
        return self.login(response)
    else:
        return self.get_section_links(response)

因此,无论何时发出请求,都会检查响应是否存在登录表单.如果存在,那么我们知道我们需要登录,所以我们调用相关函数,如果它不存在,我们调用负责从响应中抓取数据的函数.

So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.

我希望这很清楚,如果您有任何其他问题,请随时提问!

I hope this is clear, feel free to ask if you have any other questions!

好的,所以您要做的不仅仅是生成一个请求并抓取它.您想关注链接.

Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.

为此,您需要做的就是从页面上抓取相关链接,并使用这些 URL 生成请求.例如:

To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:

def parse_page(self, response):
    """ Scrape useful stuff from page, and spawn new requests

    """
    hxs = HtmlXPathSelector(response)
    images = hxs.select('//img')
    # .. do something with them
    links = hxs.select('//a/@href')

    # Yield a new request for each link we found
    for link in links:
        yield Request(url=link, callback=self.parse_page)

如您所见,它会为页面上的每个 URL 生成一个新请求,并且这些请求中的每一个都会对其响应调用相同的函数,因此我们正在进行一些递归抓取.

As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.

我上面写的只是一个例子.如果你想抓取"页面,你应该查看 CrawlSpider 而不是手动操作.

What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.

这篇关于将 Scrapy 与经过身份验证(登录)的用户会话一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆