Scrapy 与 selenium 用于需要身份验证的网页 [英] Scrapy with selenium for a webpage requiring authentication

查看:52
本文介绍了Scrapy 与 selenium 用于需要身份验证的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从具有大量 AJAX 调用和 javascript 执行的页面中抓取数据来呈现网页.所以我正在尝试使用带有 selenium 的 scrapy 来执行此操作.作案手法如下:

I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :

  1. 将登录页面 URL 添加到scrapy start_urls 列表

  1. Add the login page URL to the scrapy start_urls list

使用 formrequest from 响应方法发布用户名和密码以进行身份​​验证.

Use the formrequest from response method to post the username and password to get authenticated.

到目前为止,我的代码如下:

The code that I have thus far is as follows:

 from scrapy.spider import BaseSpider
 from scrapy.http import FormRequest, Request
 from selenium import webdriver
 import time


 class LoginSpider(BaseSpider):
    name = "sel_spid"
    start_urls = ["http://www.example.com/login.aspx"]


    def __init__(self):
        self.driver = webdriver.Firefox()


    def parse(self, response):
        return FormRequest.from_response(response,
               formdata={'User': 'username', 'Pass': 'password'},
               callback=self.check_login_response)

    def check_login_response(self, response):
        if "Log Out" in response.body:
            self.log("Successfully logged in")
            scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
            yield Request(url=scrape_url, callback=self.parse_page)
        else:
            self.log("Bad credentials")

    def parse_page(self, response):
        self.driver.get(response.url)
        next = self.driver.find_element_by_class_name('dxWeb_pNext')
        next.click()
        time.sleep(2)
        # capture the html and store in a file

到目前为止,我遇到的两个障碍是:

The 2 roadblocks i have hit till now are:

  1. 第 4 步不起作用.每当 selenium 打开 Firefox 窗口时,它总是在登录屏幕上并且不知道如何通过它.

  1. Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.

我不知道如何实现第 5 步

I don't know how to achieve step 5

任何帮助将不胜感激

推荐答案

我不相信你可以像这样在scrapy Requests 和selenium 之间切换.您需要使用 selenium 登录站点,而不是使用 yield Request().您使用scrapy 创建的登录会话不会转移到selenium 会话.这是一个示例(元素 ids/xpath 对您来说会有所不同):

I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):

    scrape_url = "http://www.example.com/authen_handler.aspx"
    driver.get(scrape_url)
    time.sleep(2)
    username = self.driver.find_element_by_id("User")
    password =  self.driver.find_element_by_name("Pass")
    username.send_keys("your_username")
    password.send_keys("your_password")
    self.driver.find_element_by_xpath("//input[@name='commit']").click()

那么你可以这样做:

    time.sleep(2)
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
    time.sleep(2)

等等.

如果您需要渲染 javascript 并且担心速度/非阻塞,您可以使用 http://splash.readthedocs.org/en/latest/index.html 应该可以解决问题.

If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie 有关于传递 cookie 的详细信息,你应该可以从scrapy 传递它,但我以前没有这样做过.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.

这篇关于Scrapy 与 selenium 用于需要身份验证的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆