Scrapy在页面上找不到表单 [英] Scrapy can't find form on page

查看:36
本文介绍了Scrapy在页面上找不到表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个自动登录到 本网站.但是,当我尝试在 shell 中使用 scrapy.FormRequest.from_response 时,出现错误:

无<表单><200 https://www.athletic.net/account/login/?ReturnUrl=%2Fdefault.aspx>

中找到的元素

当我检查站点上的元素时,我绝对可以看到表单,但是当我尝试使用 response.xpath() 查找它时,它也没有显示在 Scrapy 中.表单内容是否有可能以某种方式对我的蜘蛛隐藏?如果是这样,我该如何解决?

解决方案

该表单是使用 Javascript 创建的,它不是静态 HTML 源代码的一部分.Scrapy 不解析 Javascript,因此无法找到.

静态 HTML 的相关部分(它们使用 Javascript 注入表单)是:

<div ng-controller="AppCtrl as appC" class="m-auto pt-3 pb-5 container" style="max-width: 425px;"><部分用户界面视图></section>

要查找此类问题,我会:

  • 比较查看源代码"和检查"中的源代码
  • 使用没有 Javascript 的浏览器浏览网页(当我开发抓取工具时,我通常有一个有 Javascript 的浏览器用于研究和文档,另一个用于检查没有 Javascript 的网页)

在这种情况下,您必须为此网页手动创建 FormRequest.我无法在他们的表单上发现任何形式的 CSRF 保护,所以它可能很简单:

FormRequest(url='https://www.athletic.net/account/auth.ashx',formdata={"e": "foo@example.com", "pw": "secret"})

但是,我认为您不能使用 formdata,而是他们希望您发送 JSON.不确定 FormRequest 是否可以处理这个,我猜你只是想使用标准的 Request.

由于他们在前端大量使用 Javascript,因此您也无法使用页面的源代码来查找这些参数.相反,我使用了浏览器的开发者控制台并检查了当我尝试使用无效凭据登录时发生的请求/响应.

这给了我:

一般:请求网址:https://www.athletic.net/account/auth.ashx[...]请求有效载荷:{e:foo@example.com",密码:秘密"}

I'm trying to write a spider that will automatically log in to this website. However, when I try using scrapy.FormRequest.from_response in the shell I get the error:

No <form> element found in <200 https://www.athletic.net/account/login/?ReturnUrl=%2Fdefault.aspx>

I can definitely see the form when I inspect element on the site, but it just did not show up in Scrapy when I tried finding it using response.xpath() either. Is it possible for the form content to be hidden from my spider somehow? If so, how do I fix it?

解决方案

The form is created using Javascript, it's not part of the static HTML source code. Scrapy does not parse Javascript, thus it cannot be found.

The relevant part of the static HTML (where they inject the form using Javascript) is:

<div ng-controller="AppCtrl as appC" class="m-auto pt-3 pb-5 container" style="max-width: 425px;">
    <section ui-view></section>
</div>

To find issues like this, I would either:

  • compare the source code from "View Source Code" and "Inspect" to each other
  • browse the web page with a browser without Javascript (when I develop scrapers I usually have one browser with Javascript for research and documentations and another one for checking web pages without Javascript)

In this case, you have to manually create your FormRequest for this web page. I was not able to spot any form of CSRF protection on their form, so it might be as simple as:

FormRequest(url='https://www.athletic.net/account/auth.ashx',
            formdata={"e": "foo@example.com", "pw": "secret"})

However, I think you cannot use formdata, but instead they expect you to send JSON. Not sure if FormRequest can handle this, I guess you just want to use a standard Request.

Since they heavily use Javascript on their front end, you cannot use the source code of the page to find these parameters either. Instead, I used the developer console of my browser and checked the request/response that happened when I tried to login with invalid credentials.

This gave me:

General:
Request URL: https://www.athletic.net/account/auth.ashx
[...]

Request Payload:
{e: "foo@example.com", pw: "secret"}

这篇关于Scrapy在页面上找不到表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆