使用请求无法解析特定页面后的链接 [英] Can't parse links after a certain page using requests

查看:83
本文介绍了使用请求无法解析特定页面后的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用python创建了一个脚本来解析遍历多个页面的不同项目的链接.为了解析其登录页面中的链接,get请求也可以使用,因此我在第一页中使用了get请求.

I've created a script using python to parse the links of different items traversing multiple pages. To parse the links from it's landing page, get requests also works, so I used get requests for the first page.

但是,需要发出带有适当参数的发布请求才能从下一页获取链接.我也这样做了.该脚本现在可以解析最多11页的链接. 问题出现在12页之后,依此类推 .该脚本不再起作用.我尝试使用20,50,100,150之类的不同页面.没有结果.

However, it is required to issue post requests with appropriate parameters to get the links from next pages. I did that as well. The script can now parse the links upto 11 pages. Trouble comes up when it gets after the 12 page and so on. The script doesn't work anymore. I tried with different pages like 20,50,100,150. None worked out.

我尝试过:

import time
import requests
from bs4 import BeautifulSoup

res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'

params = {
    'CountryId': '0',
    'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
    'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}

with requests.Session() as s:
    page = 11
    while True:
        print("**"*5,"trying with page:",page)
        req = s.get(res_url,params=params)
        soup = BeautifulSoup(req.text,"lxml")
        if page==1:
            for item_link in soup.select("h4 > a.colorBlue[href]"):
                print(item_link.get("href"))

        else:
            payload = {i['name']:i.get('value') for i in soup.select('input[name]')}
            payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gv_Results'
            payload['__EVENTARGUMENT'] = f"{'Page$'}{page}"
            payload['ctl00$ContentPlaceHolder1$ddl_SortValue'] = 'SiteName'

        res = s.post(res_url,params=params,data=payload)
        sauce = BeautifulSoup(res.text,"lxml")
        if not sauce.select("h4 > a.colorBlue[href]"):break
        for elem_link in sauce.select("h4 > a.colorBlue[href]"):
            print(elem_link.get("href"))

        page+=1
        time.sleep(3)

如何使用请求在11个页面后抓取链接?

How can I scrape links after 11 pages using requests?

推荐答案

我认为您的抓取逻辑是正确的,但是在您的循环中,您每次都执行GET + POST,而您应该第一次执行GET,然后发出下一次迭代的POST(如果需要1次迭代= 1页)

I think your scraping logic is correct but in your loop your are doing a GET + a POST each time whereas you should do a GET the first time then issue a POST for the next iterations (if you want 1 iteration = 1 page)

一个例子:

import requests
from bs4 import BeautifulSoup

res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'

params = {
    'CountryId': '0',
    'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
    'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}

max_page = 20

def extract(page, soup):
    for item_link in soup.select("h4 a.colorBlue"):
        print("for page {} - {}".format(page, item_link.get("href")))

def build_payload(page, soup):
    payload = {}
    for input_item in soup.select("input"):
        payload[input_item["name"]] = input_item["value"]
    payload["__EVENTTARGET"]="ctl00$ContentPlaceHolder1$gv_Results"
    payload["__EVENTARGUMENT"]="Page${}".format(page)
    payload["ctl00$ContentPlaceHolder1$ddl_SortValue"] = "SiteName"
    return payload

with requests.Session() as s:
    for page in range(1, max_page):
        if (page > 1):
            req = s.post(res_url, params = params, data = build_payload(page, soup))
        else:
            req = s.get(res_url,params=params)
        soup = BeautifulSoup(req.text,"lxml")
        extract(page, soup)

这篇关于使用请求无法解析特定页面后的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆