抓取 aspx 网站的页面 - 仅获取第 1 页 [英] Scraping through pages of aspx website -only gets page 1

查看：25 发布时间：2021/9/24 19:06:25 python asp.net python-3.x web-scraping python-requests

本文介绍了抓取 aspx 网站的页面 - 仅获取第 1 页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在过去一个月左右的时间里，我一直试图从一个 aspx 站点读取一些页面.我在网站上找到所有必需的项目没有问题，但我尝试的解决方案仍然无法正常工作.我在某处读到所有标题详细信息都必须存在，所以我添加了它们.我还在某处读到 __EVENTTARGET 必须设置为告诉 aspx 哪个按钮被按下，所以我尝试了一些不同的东西(见下文).我还读到应该建立一个会话来处理 cookie - 所以我也实现了它.到目前为止，我的代码片段产生的信息与我使用 Web 开发人员工具分析发布请求时获得的信息完全相同(打印行已被注释掉) - 但此代码始终为我提供第一页.有谁知道此代码中缺少什么才能使其正常工作.我还应该指出，硒或机械化并不是这个项目的真正选择.

for the last month or so, I've been trying to read a few pages from an aspx site. I have no problems finding all the required items on the site but my attempted solution is still not working properly. I read somewhere that all the header details must be present, so I added them. I also read somewhere that the __EVENTTARGET must be set to something to tell aspx which button had been pressed, so I tried a few different things(see below). I also read that a session should be established to deal with the cookies - so I implemented that as well. As of now, my code snippet produce the exact same info that I get when I use a web developper tool to analyse the post request(the print lines have been commented out) - but this code always give me the first page. Does anyone know what is missing in this code for it to work. I should also point out that selenium or mechanize is not really an option for this project.

import requests
from bs4 import BeautifulSoup
import time
import collections
import json

def SPAIN_STK_LIST(numpage):
    payload = collections.OrderedDict()
    header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
          'Accept-Encoding' : 'gzip, deflate',
          'Accept-language' : 'en-US,en;q=0.9',
          'Cache-Control' : 'max-age=0',
          'Connection' : 'keep-alive',
          'Content-Type': 'text/html; charset=utf-8',
          'Host' : 'www.bolsamadrid.es',
          'Origin' : 'null',
          'Upgrade-Insecure-Requests' : '1',
          'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
          }
for i in range(0, numpage):
    ses = requests.session()
    if(i == 0):
        req = ses.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header)
    else:
        req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, data = payload)
#        print(req.request.body)
#        print(req.request.headers)
#        print(req.request.url)
    page = req.text
    soup = BeautifulSoup(page, "lxml")
    # find __VIEWSTATE and __EVENTVALIDATION for the next page
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
#        print("VIEWSTATE: ", viewstate)
    eventval = soup.select("#__EVENTVALIDATION")[0]['value']
#        print("EVENTVALIDATION: ", eventval)
    header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
          'Accept-Encoding' : 'gzip, deflate',
          'Accept-language' : 'en-US,en;q=0.9',
          'Cache-Control' : 'max-age=0',
          'Connection' : 'keep-alive',
          'Content-Type': 'application/x-www-form-urlencoded',
          'Host' : 'www.bolsamadrid.es',
          'Origin' : 'null',
          'Upgrade-Insecure-Requests' : '1',
          'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
          }
    target = "ct100$Contenido$GoPag{:=>2}"
    payload = collections.OrderedDict()
    payload['__EVENTTARGET'] = ""
    #payload['__EVENTTARGET'] = "GoPag"
    #payload['__EVENTTARGET'] = "ct100$Contenido$GoPag"
    #payload['__EVENTTARGET'] = target.format(i + 1)
    payload['__EVENTARGUMENT'] = ""
    payload['__VIEWSTATE'] = viewstate
    payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
    payload['__EVENTVALIDATION'] = eventval
    payload['ct100$Contenido$GoPag'] = i + 1
    table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
    for row in table.findAll("tr")[1:]:
        cells = row.findAll("td")
        print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
    time.sleep(1)


SPAIN_STK_LIST(6)

请注意，第一个标头内容类型设置为text/html"，因为这是第一个请求，但任何后续请求都使用application/x-www-form-urlencoded"的类型内容完成.任何关于我接下来应该尝试什么的指示将不胜感激.E.

Note that the first header content-type is set to "text/html" as this is the first request but any subsequent requests is done with a type-content of "application/x-www-form-urlencoded". Any pointers as to what I should try next would be much appreciated. E.

抓取 aspx 网站的页面 - 仅获取第 1 页 [英] Scraping through pages of aspx website -only gets page 1

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

抓取 aspx 网站的页面 - 仅获取第 1 页 [英] Scraping through pages of aspx website -only gets page 1

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭