如何使用scrapy触发JS ASP.Net下一页事件? [英] How to trigger a JS ASP.Net next page event using scrapy?

查看:343
本文介绍了如何使用scrapy触发JS ASP.Net下一页事件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从网站我首先发送了一个 FormRequest 会根据 Wim Herman的答案得出搜索结果在我的其他问题上此处

I'm scraping content off this website I start by sending a FormRequest that yields the search result based on Wim Herman's answer on my other question here

我抓取了所需内容,并想转到下一个不包含URL的页面,它是由JS触发的。 html标记的外观如下:

I scrape what is needed and want to move to the next page which does not consist of a url, it's triggered by JS. Here's how the html tag looks like:

<a href="javascript:__doPostBack('dgSearchResults$ctl24$ctl01','')">2</a>

我尝试了以下操作,但似乎无济于事:

I tried the following and nothing seems to work:

In [18]: fr = FormRequest.from_response(response, formdata={"__EVENTTARGET": 'dg
    ...: SearchResults$ctl02$ctl03'})                                           

In [19]: fetch(fr)                                                              
2020-08-24 16:47:06 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx> (referer: None)

In [20]: view(response)                                                         
Out[20]: True

和以下内容:

In [21]: fr = FormRequest.from_response(response, formdata={"__EVENTTARGET": 'dg
    ...: SearchResults$ctl02$ctl01'}, clickdata={'type': 'submit'})             

In [22]: fetch(fr)                                                              
2020-08-24 16:50:24 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx> (referer: None)

In [23]: view(response)                                                         
Out[23]: True

当我查看响应时,它要么使我登陆初始页面(包含初始表单的页面),要么什么也没发生,但页面号仍设置为1。

when I view the response, it either lands me on the initial page (the one containing the initial form) or just nothing happens, the page number is still set to 1.

推荐答案

正如我在评论中提到的,这是ASP Net页面上的常见问题。您可能现在知道,您提到的js将触发POST请求。此发布请求的正文可能包含您在搜索表单中填写的字段作为输入以及由页面实例生成的一些隐藏输入(例如 __ VIEWSTATE __VIEWSTATEGENERATOR )。

As I mention in the comment this is pretty common issue on ASP Net pages. As you probably know by now the js you mentioned will trigger a POST request. The body of this post request may contain fields that you filled in your search form as inputs and several hidden inputs generated by the page instance (like __VIEWSTATE or __VIEWSTATEGENERATOR ).

使用 FormRequest.from_response()方法时,它将搜索这些输入填充请求正文,方法是选择页面中 // form 元素内的所有输入元素。有时候没关系,有时候不是,那是您的情况。

When you use the FormRequest.from_response() method it will search for those inputs to fill the request body, it does that by selecting all input elements inside the //form element in the page. Sometimes that's ok, sometimes it isn't, that's your case.

当方法选择所有输入时,它会得到一个用于其他目的的输入。在您的情况下,此输入为:

When the method selects all inputs, it gets an input that was meant for something else. In your case it is this input:

<input id="cmdSearchNew" value="New Search" ... />


您怎么知道?


如果使用浏览器的开发工具并进行分析如何将请求从第1页更改为第2页,您将看到这是一个POST请求,它的主体是这样的:

How would you know?

If you use your browser's dev tools and analyse how the request is made to change from page 1 to 2 you will see that it's a POST request and it's body is something like this:

{
    "__EVENTTARGET":"dgSearchResults$ctl24$ctl01",
    "__EVENTARGUMENT":"",
    "__VIEWSTATE":"jyAD4Bm...",
    "__VIEWSTATEGENERATOR":"11C1F95B",
    "__EVENTVALIDATION":"TmG0xFB..."
}

但是,如果您检查了请求请求的正文(,则可以打印 fr.body 在您已经在使用的外壳程序中),您将看到类似这样的内容:

However, if you inspect the body of your scrapy request (you can print your fr.body in the shell you are already using) you will see somethng like this:

{
    "__EVENTTARGET":"dgSearchResults$ctl24$ctl01",
    "cmdSearchNew": "New Search"
    "__VIEWSTATE":"jyAD4Bm...",
    "__VIEWSTATEGENERATOR":"11C1F95B",
    "__EVENTVALIDATION":"TmG0xFB..."
}

它将被urlencoded,这是一个已解析的视图

cmdSearchNew 字段不应该存在,这是为了其他目的,但是scrapy却不知道这是因为它位于同一表格中。 (也不会出现 __ EVENTARGUMENT ,因为该值为空,所以Scrapy会忽略它)

That cmdSearchNew field shouldn't be there, it's meant for something else, but scrapy couldn't know that as it was inside the same form. (Also __EVENTARGUMENT won't be there because the value is empty, so Scrapy will ignore it)

一旦发现问题,您可以通过将 from_response()方法设置为 None

Once you identified the problem, you can tell the from_response() method that you don't want a specific field to be in the body, by setting it to None.

fr = FormRequest.from_response(response, formdata={
    '__EVENTTARGET': 'dgSearchResults$ctl24$ctl01',
    'cmdSearchNew': None
})

这对您来说应该足够了页面2的响应。

This should be enough for you to get the response for page 2.

这篇关于如何使用scrapy触发JS ASP.Net下一页事件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆