提交使用Scrapy动态呈现的表单? [英] Submit form that renders dynamically with Scrapy?

查看:82
本文介绍了提交使用Scrapy动态呈现的表单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Scrapy提交动态生成的用户登录表单,然后解析与成功登录相对应的页面上的HTML.

I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login.

我想知道如何使用Scrapy或Scrapy和Selenium的组合来做到这一点. Selenium使得可以在DOM上找到该元素,但我想知道是否有可能在获取完整的HTML之后将控制权交还给" Scrapy,以便允许它执行表单提交并保存必要的cookie. ,会话数据等以抓取页面.

I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session data etc. in order to scrape the page.

基本上,我认为Selenium是必需的唯一原因是因为在Scrapy查找<form>元素之前,我需要使用Java渲染页面.但是,还有其他选择吗?

Basically, the only reason I thought Selenium was necessary was because I needed the page to render from the Javascript before Scrapy looks for the <form> element. Are there any alternatives to this, however?

谢谢!

此问题类似于此问题,但不幸的是,被接受的答案是使用Requests库而不是Selenium或Scrapy处理的.尽管在某些情况下可能会出现这种情况(观看以了解更多信息), alecxe指出,如果部分页面(例如表单)是通过API调用加载并借助浏览器中执行的javascript代码插入页面中的,则可能需要Selenium".

This question is similar to this one, but unfortunately the accepted answer deals with the Requests library instead of Selenium or Scrapy. Though that scenario may be possible in some cases (watch this to learn more), as alecxe points out, Selenium may be required if "parts of the page [such as forms] are loaded via API calls and inserted into the page with the help of javascript code being executed in the browser".

推荐答案

Scrapy实际上不是非常适合Coursera网站,因为它是非常异步的.页面的一部分通过API调用加载,并借助浏览器中正在执行的javascript代码插入页面. Scrapy不是浏览器,无法处理.

Scrapy is not actually a great fit for coursera site since it is extremely asynchronous. Parts of the page are loaded via API calls and inserted into the page with a help of javascript code being executed in the browser. Scrapy is not a browser and cannot handle it.

这提出了一个问题-为什么不使用公开发布的 Coursera API ?

Which raises the point - why not use the publicly available Coursera API?

除了已记录的内容外,在浏览器开发人员工具中还可以看到其他端点,您需要对其进行身份验证才能使用它们.例如,如果您已登录,则可以查看已参加的课程列表:

Aside from what is documented, there are other endpoints that you can see called in browser developer tools - you need to be authenticated to be able to use them. For example, if you are logged in, you can see the list of courses you've taken:

有一个呼叫memberships.v1端点.

为了举例说明,让我们开始selenium,登录并使用

For the sake of an example, let's start selenium, log in and grab the cookies with get_cookies(). Then, let's yield a Request to memberships.v1 endpoint to get the list of archived courses providing the cookies we've got from selenium:

import json

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


LOGIN = 'email'
PASSWORD = 'password'

class CourseraSpider(scrapy.Spider):
    name = "courseraSpider"
    allowed_domains = ["coursera.org"]

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.driver.maximize_window()
        self.driver.get('https://www.coursera.org/login')

        form = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@data-js='login-body']//div[@data-js='facebook-button-divider']/following-sibling::form")))
        email = WebDriverWait(form, 10).until(EC.visibility_of_element_located((By.ID, 'user-modal-email')))
        email.send_keys(LOGIN)

        password = form.find_element_by_name('password')
        password.send_keys(PASSWORD)

        login = form.find_element_by_xpath('//button[. = "Log In"]')
        login.click()

        WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[. = 'My Courses']")))

        self.driver.get('https://www.coursera.org/')
        cookies = self.driver.get_cookies()

        self.driver.close()

        courses_url = 'https://www.coursera.org/api/memberships.v1'
        params = {
            'fields': 'courseId,enrolledTimestamp,grade,id,lastAccessedTimestamp,role,v1SessionId,vc,vcMembershipId,courses.v1(display,partnerIds,photoUrl,specializations,startDate,v1Details),partners.v1(homeLink,name),v1Details.v1(sessionIds),v1Sessions.v1(active,dbEndDate,durationString,hasSigTrack,startDay,startMonth,startYear),specializations.v1(logo,name,partnerIds,shortName)&includes=courseId,vcMembershipId,courses.v1(partnerIds,specializations,v1Details),v1Details.v1(sessionIds),specializations.v1(partnerIds)',
            'q': 'me',
            'showHidden': 'false',
            'filter': 'archived'
        }

        params = '&'.join(key + '=' + value for key, value in params.iteritems())
        yield scrapy.Request(courses_url + '?' + params, cookies=cookies)

    def parse(self, response):
        data = json.loads(response.body)

        for course in data['linked']['courses.v1']:
            print course['name']

对我来说,它会打印:

Algorithms, Part I
Computing for Data Analysis
Pattern-Oriented Software Architectures for Concurrent and Networked Software
Computer Networks

这证明我们可以从selenium提供Scrapy cookie,并成功地从仅限登录用户"页面中提取数据.

Which proves that we can give Scrapy the cookies from selenium and successfully extract the data from the "for logged in users only" pages.

另外,请确保您没有违反使用条款中的规则,具体来说:

Additionally, make sure you don't violate the rules from the Terms of Use, specifically:

此外,作为访问网站的条件,您同意不 ...(c)使用任何大批量,自动化或电子方式访问 网站(包括但不限于机器人,蜘蛛,脚本或 网络抓取工具);

In addition, as a condition of accessing the Sites, you agree not to ... (c) use any high-volume, automated or electronic means to access the Sites (including without limitation, robots, spiders, scripts or web-scraping tools);

这篇关于提交使用Scrapy动态呈现的表单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆