抓取 ASPX 表单并避免使用 Selenium [英] Scraping ASPX form and avoiding Selenium

查看:24
本文介绍了抓取 ASPX 表单并避免使用 Selenium的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前问过(请参阅此处)如何从 ASPX 表单中抓取结果.表单在新选项卡中呈现输出(通过使用 JS 中的 window.open 函数).在我之前的帖子中,我没有提出正确的 POST 请求,我解决了这个问题.

以下代码使用正确的请求标头成功从表单中检索到 HTML 代码,它与我在 Chrome 检查器中看到的 POST 响应完全相同.但是(...)我无法检索数据.用户做出选择后,会打开一个新的弹出窗口,但我无法捕捉到它.弹出窗口有一个新的 URL,它的信息不是请求响应正文的一部分.

请求网址:https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx

弹出网址【我要下载的数据】:https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx

url = 'https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx'使用 requests.Session() 作为 s:s.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36","Content-Type": "application/x-www-form-urlencoded","接受": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Referer": "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx","Accept-Encoding": "gzip, deflate, br",接受语言":en-US,en;q=0.9"}响应 = s.get(url)汤 = BeautifulSoup(response.content, 'html5lib')数据 = { 标签['名称']:标签['值']for tag in soup.select('input[name^=ctl00]') if tag.get('value')}state = { tag['name']: tag['value']用于soup.select('input[name^=__]') 中的标签}有效载荷 = data.copy()有效载荷.更新(状态)有效载荷.更新({"ctl00$MainContent$rdoCommoditySystem": "ELEC","ctl00$MainContent$lbReportName": '76',"ctl00$MainContent$rdoReportFormat": 'PDF',"ctl00$MainContent$ddlStartYear": "2008","__EVENTTARGET": "ctl00$MainContent$rdoCommoditySystem$2"})打印(有效负载['__EVENTTARGET'])打印(有效负载['__VIEWSTATE'][-20:])response = s.post(url, data=payload, allow_redirects=True)汤 = BeautifulSoup(response.content, 'html5lib')state = { tag['name']: tag['value']用于soup.select('input[name^=__]') 中的标签}payload.pop("ctl00$MainContent$ddlStartYear")有效载荷.更新(状态)有效载荷.更新({"__EVENTTARGET": "ctl00$MainContent$lbReportName","ctl00$MainContent$lbReportName": "171","ctl00$MainContent$ddlFrom": "01/12/2018 12:00:00 AM"})打印(有效负载['__EVENTTARGET'])打印(有效负载['__VIEWSTATE'][-20:])response = s.post(url, data=payload, allow_redirects=True)汤 = BeautifulSoup(response.content, 'html5lib')state = { tag['name']: tag['value']用于soup.select('input[name^=__]') 中的标签}有效载荷.更新(状态)有效载荷.更新({"ctl00$MainContent$ddlFrom": "01/10/1990 12:00:00 AM","ctl00$MainContent$rdoReportFormat": "HTML","ctl00$MainContent$btnView": "视图"})打印(有效负载['__VIEWSTATE'])response = s.post(url, data=payload, allow_redirects=True)打印(响应.文本)

有什么方法可以使用 requestsbs4 从弹出窗口中检索数据?我注意到 html-requests 可以解析和渲染 JS,但是我所有的尝试都没有成功.

url 源显示了这段 JS 代码,我猜这是打开带有数据的弹出窗口的代码:

<预><代码>//<![CDATA[window.open("ViewReport.aspx", "_blank");Sys.Application.initialize();//]]>

但是我无法访问它.

解决方案

查看这个scrapy博客 https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition

我过去曾使用这个概念来抓取 aspx 页面.

I asked previously (see here) how to scrape results from an ASPX form. The form renders the output in a new tab (by using the function window.open in JS). In my previous post, I wasn't making the correct POST request, and I solved that.

The following code successfully retrieves the HTML code from the form with the correct request headers, and it's exactly equal to the POST response I see in the Chrome inspector. But (...) I can't retrieve the data. Once the user make the selections, a new pop-up window opens, but I am not being able to catch it. The pop-up window has a new URL and its information is not part of the request response body.

Request URL: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx

Pop-up URL [the data I want to download]: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx

url = 'https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx'

with requests.Session() as s:
        s.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36",
            "Content-Type": "application/x-www-form-urlencoded",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9"
        }

        response = s.get(url)
        soup = BeautifulSoup(response.content, 'html5lib')

        data = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=ctl00]') if tag.get('value')
            }
        state = { tag['name']: tag['value'] 
                for tag in soup.select('input[name^=__]')
            }

        payload = data.copy()
        payload.update(state)

        payload.update({
            "ctl00$MainContent$rdoCommoditySystem": "ELEC",
            "ctl00$MainContent$lbReportName": '76',
            "ctl00$MainContent$rdoReportFormat": 'PDF',
            "ctl00$MainContent$ddlStartYear": "2008",
            "__EVENTTARGET": "ctl00$MainContent$rdoCommoditySystem$2"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value'] 
                 for tag in soup.select('input[name^=__]')
             }

        payload.pop("ctl00$MainContent$ddlStartYear")
        payload.update(state)
        payload.update({
            "__EVENTTARGET": "ctl00$MainContent$lbReportName",
            "ctl00$MainContent$lbReportName": "171",
            "ctl00$MainContent$ddlFrom": "01/12/2018 12:00:00 AM"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value']
                 for tag in soup.select('input[name^=__]')
                }

        payload.update(state)
        payload.update({
            "ctl00$MainContent$ddlFrom": "01/10/1990 12:00:00 AM",
            "ctl00$MainContent$rdoReportFormat": "HTML",
            "ctl00$MainContent$btnView": "View"
        })

        print(payload['__VIEWSTATE'])

        response = s.post(url, data=payload, allow_redirects=True)
        print(response.text)

There is any way to retrieve the data from the pop-up window using requests and bs4? I noticed that html-requests can parse and render JS, but all my trials have been unsuccessful.

The url source shows this JS code, which I guess is the one opening the pop-up window with the data:


//<![CDATA[
window.open("ViewReport.aspx", "_blank");Sys.Application.initialize();
//]]>

But I'm unable to access to it.

解决方案

See this scrapy blog https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition

I have used this concept in the past in order to scrape aspx pages.

这篇关于抓取 ASPX 表单并避免使用 Selenium的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆