无法使用请求从下一页抓取名称 [英] Can't scrape names from next pages using requests

查看:220
本文介绍了无法使用请求从下一页抓取名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python脚本解析遍历网页的多个页面的名称.通过当前的尝试,我可以从其登录页面获取名称.但是,我找不到使用请求和BeautifulSoup从下一页获取名称的想法.

I'm trying to parse names traversing multiple pages from a webpage using a python script. With my current attempt I can get the names from it's landing page. However, I can't find any idea to fetch the names from next pages as well using requests and BeautifulSoup.

网站链接

到目前为止,我的尝试:

My attempt so far:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

我尝试修改脚本以仅从第二页获取内容,以确保当涉及到下一页按钮时该脚本可以正常工作,但不幸的是,它仍从第一页获取数据:

I've tried to modify my script to get only the content from the second page to make sure it is working when there is a next page button involved but unfortunately it still fetches data from the first page:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTARGUMENT'] = 'Page$Next'
    payload.pop('btnClose')
    payload.pop('btnMapClose')
    res = s.post(url,data=payload,headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
        })
    sauce = BeautifulSoup(res.text,"lxml")
    for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

推荐答案

正在通过__VIEWSTATE游标通过POST请求导航到下一页.

Navigating to next page is being performed via POST request with __VIEWSTATE cursor.

如何处理请求:

  1. 向第一页发送GET请求;

  1. Make GET request to first page;

解析所需的数据和__VIEWSTATE游标;

Parse required data and __VIEWSTATE cursor;

使用接收到的光标为下一页准备POST请求;

Prepare POST request for next page with received cursor;

运行它,解析所有数据和新光标以进入下一页.

Run it, parse all data and new cursor for next page.

我不会提供任何代码,因为它需要写下几乎所有搜寻器的代码.

I won't provide any code, because it requires to write down almost all crawler's code.

====已添加====

==== Added ====

您几乎做到了,但是您错过了两个重要的事情.

You almost done it, but there are two important things you have missed.

  1. 必须通过第一个GET请求发送标头.如果没有发送标头,则会导致令牌破损(很容易通过视觉检测-末尾还没有==)

  1. It is necessary to send headers with first GET request. If there're no headers sent - we get broken tokens (it is easy to detect visually - they haven't == at the end)

我们需要在发送的有效载荷中添加 __ ASYNCPOST . (这很有趣:它不是布尔值True,而是字符串'true')

We need to add __ASYNCPOST to payload we send. (It is very interesting: it is not a boolean True, it is a string 'true')

这里是代码.我删除了bs4并添加了lxml(我不喜欢bs4,它非常慢).我们确切地知道我们需要发送哪些数据,所以让我们仅分析很少的输入.

Here's code. I removed bs4 and added lxml (i don't like bs4, it is very slow). We exactly know which data we need to send, so let's parse only few inputs.

import re
import requests
from lxml import etree


def get_nextpage_tokens(response_body):
    """ Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
    try:
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
        payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
        payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
        payload['__ASYNCPOST'] = 'true'
        return payload
    except:
        return None


if __name__ == '__main__':
    url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
            }

    with requests.Session() as s:
        page_num = 1
        r = s.get(url, headers=headers)
        parser = etree.HTMLParser()
        tree = etree.fromstring(r.text, parser)

        # Creating payload
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATE'] = tree.xpath("//input[@name='__VIEWSTATE']/@value")[0]
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]
        payload['__EVENTVALIDATION'] = tree.xpath("//input[@name='__EVENTVALIDATION']/@value")[0]
        payload['__ASYNCPOST'] = 'true'
        headers['X-Requested-With'] = 'XMLHttpRequest'

        while True:
            page_num += 1
            res = s.post(url, data=payload, headers=headers)

            print(f'page {page_num} data: {res.text}')  # FIXME: Parse data

            payload = get_nextpage_tokens(res.text)  # Creating payload for next page
            if not payload:
                # Break if we got no tokens - maybe it was last page (it must be checked)
                break

重要

Important

响应格式不正确的HTML.因此,您必须处理它:切割桌子或其他东西.祝你好运!

Response not a well formed HTML. So You have to deal with it: cut table or something else. Good luck!

这篇关于无法使用请求从下一页抓取名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆