使用Python抓取.aspx表单 [英] Scrape .aspx form with Python

查看:54
本文介绍了使用Python抓取.aspx表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取: https://apps.neb- one.gc.ca/CommodityStatistics/Statistics.aspx ,从纸上看,这似乎是一项容易的任务,并且具有来自其他SO问题的大量资源.尽管如此,无论我如何更改请求,我都会遇到相同的错误.

i'm trying to scrape: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx, which in paper seems like a easy task and with a lot of resources from other SO questions. Nonetheless, I'm getting the same error no matter how I change my request.

我尝试了以下操作:

import requests
from bs4 import BeautifulSoup

url = "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx"

with requests.Session() as s:
    s.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}

    response = s.get(url)
    soup = BeautifulSoup(response.content)

     data = {
         "ctl00$MainContent$rdoCommoditySystem": "ELEC",
         "ctl00$MainContent$lbReportName": "171",
         "ctl00$MainContent$ddlFrom": "01/11/2018 12:00:00 AM",
         "ctl00$MainContent$rdoReportFormat": "Excel",
         "ctl00$MainContent$btnView": "View",
         "__EVENTVALIDATION": soup.find('input', {'name':'__EVENTVALIDATION'}).get('value',''),
         "__VIEWSTATE": soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
         "__VIEWSTATEGENERATOR": soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', '')
     }

    response = requests.post(url, data=data)

当我打印response.contents对象时,得到以下消息(tl; dr,它表示发生系统错误.系统将警告该问题的技术支持" ):

When I print the response.contents object, I get this message (tl;dr, it says that "System error occurred. The system will alert technical support of the problem"):

b'\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml" >\r\n<head><title>\r\n\r\n</title></head>\r\n<body>\r\n   <form name="form1" method="post" action="Error.aspx?ErrorID=86e0c980-7832-4fc5-b5a8-a8254dd8ad69" id="form1">\r\n<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTg3NjI4NzkzNmRkaCA5IA9393/t2iMAptLYU1QiPc8=" />\r\n\r\n<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9D6BDE45" />\r\n    <div>\r\n        <h4>\r\n            <span id="lblError">Error</span>\r\n        </h4>\r\n        <span id="lblMessage" class="Validator"><font color="Black">System error occurred. The system will alert technical support of the problem.</font></span>\r\n    </div>\r\n    </form>\r\n</body>\r\n</html>\r\n'

我使用了其他选项,如更改__EVENTTARGET参数,如建议的此处,并将cookie从第一个请求传递到POST请求.检查页面的源代码后,我注意到该表单具有查询"功能,需要__EVENTTARGET__EVENTARGUMENT才能起作用:

I have used other options, like change the __EVENTTARGET argument, as suggested here, and also pass the cookie from the first request to the POST request. Checking the source of the page, I noticed that the form has a "query" function that need the __EVENTTARGET and __EVENTARGUMENT to work:

//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
    theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>

但是在POST响应的正文中,两个参数都是空的(可以在Chrome开发人员检查器中检查).另一个问题是,我需要以任何一种格式(PDF或Excel)下载文件,或者获取HTML版本,但是.ASPX表单无法在同一页面上呈现信息,因此会打开一个新的URL: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx 加上信息.

But both arguments are empty (as can be checked in the Chrome developer inspector) in the body of the POST response. Another problem is that I need to either download the file in any of the formats (PDF or Excel), or get the HTML version, but the .ASPX form do not render the information in the same page, it open a new url: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx with the information instead.

我在这里迷路了,我想念什么?

I am kind of lost here, what I am missing?

推荐答案

我能够通过更加小心地处理__VIEWSTATE值来成功解决此问题.在ASPX表单中,页面使用__VIEWSTATE来哈希网页的状态(即用户已经选择了表单的哪些选项,或者在我们的情况下是请求的用户),并允许下一个请求.

I was able to successfully solve this problem by handling the __VIEWSTATE values with more care. In a ASPX form, the page is using the __VIEWSTATE to hash the status of the webpage (i.e. which options of the form has the user already selected, or in our case requested), and allow the next request.

在这种情况下:

  1. 请求获取所有标头,将其存储在payload中,并通过更新字典添加我的第一个选择.
  2. 再次发出具有更新的__VIEWSTATE值的请求,并将更多选项添加到我的请求中.
  3. 与2相同,但添加了最终选项.
  1. Request to get all headers, store those in the payload and add my first selection by updating the dictionary.
  2. Make a second request with an updated __VIEWSTATE value, and add more options into my request.
  3. Same as 2., but adding the final option.

这将使我有五个HTML正文,这是我使用浏览器发出请求时获得的HTML正文,但仍然无法显示数据,也不允许我将文件下载为上一个请求正文的一部分. selenium可以解决此问题,但是我没有成功. 这样的问题描述了我的问题.

This will five me the same HTML body I get when I make my request using the browser, but still does not show me the data, or allow me to download the files as part of the body of the last request. This problem can be handled with selenium, but I haven't been sucessful. This question in SO describe my problem.

url = 'https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx'

with requests.Session() as s:
        s.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36",
            "Content-Type": "application/x-www-form-urlencoded",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9"
        }

        response = s.get(url)
        soup = BeautifulSoup(response.content, 'html5lib')

        data = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=ctl00]') if tag.get('value')
            }
        state = { tag['name']: tag['value'] 
                for tag in soup.select('input[name^=__]')
            }

        payload = data.copy()
        payload.update(state)

        payload.update({
            "ctl00$MainContent$rdoCommoditySystem": "ELEC",
            "ctl00$MainContent$lbReportName": '76',
            "ctl00$MainContent$rdoReportFormat": 'PDF',
            "ctl00$MainContent$ddlStartYear": "2008",
            "__EVENTTARGET": "ctl00$MainContent$rdoCommoditySystem$2"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value'] 
                 for tag in soup.select('input[name^=__]')
             }

        payload.pop("ctl00$MainContent$ddlStartYear")
        payload.update(state)
        payload.update({
            "__EVENTTARGET": "ctl00$MainContent$lbReportName",
            "ctl00$MainContent$lbReportName": "171",
            "ctl00$MainContent$ddlFrom": "01/12/2018 12:00:00 AM"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value']
                 for tag in soup.select('input[name^=__]')
                }

        payload.update(state)
        payload.update({
            "ctl00$MainContent$ddlFrom": "01/10/1990 12:00:00 AM",
            "ctl00$MainContent$rdoReportFormat": "HTML",
            "ctl00$MainContent$btnView": "View"
        })

        print(payload['__VIEWSTATE'])

        response = s.post(url, data=payload, allow_redirects=True)
        print(response.text)

这篇关于使用Python抓取.aspx表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆