使用 Requests/BeautifulSoup 抓取网站时绕过脚本响应 [英] Bypassing script response when scraping website with Requests/BeautifulSoup

查看：24 发布时间：2021/9/24 19:05:38 javascript python web-scraping beautifulsoup python-requests

本文介绍了使用 Requests/BeautifulSoup 抓取网站时绕过脚本响应的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在 www.marriot.com 上搜索有关他们的酒店和价格的信息.我使用 chrome inspect 工具来监控网络流量，以确定 marriot 正在使用什么 API 端点.

I am scraping www.marriot.com for information on their hotels and prices. I used the chrome inspect tool to monitor network traffic to figure out what API endpoint marriot is using.

这是我试图模仿的请求:

This is the request I am trying to emulate:

http://www.marriott.com/reservation/availabilitySearch.mi?propertyCode=TYSMC&isSearch=true&fromDate=02/23/17&toDate=02/24/17&numberOfRooms=1&numberOfGuests=1&numberOfChildren=0&numberOfAdults=1

使用我的python代码:

With my python code:

import requests
from bs4 import BeautifulSoup

base_uri = 'https://www.marriott.com'
availability_search_ext = '/reservation/availabilitySearch.mi'

rate_params = {
   'propertyCode': 'TYSMC',
   'isSearch': 'true',
   'fromDate': '03/01/17',
   'toDate': '03/02/17',
   'numberOfRooms': '1',
   'numberOfGuests': '1',
   'numberOfChildren': '0',
   'numberOfAdults': '1'
}

def get_rates(sess):
    first_resp = sess.get(base_uri + availability_search_ext, params=rate_params)
    soup = BeautifulSoup(first_resp.content, 'html.parser')
    print soup.title

if __name__ == "__main__":
    with requests.Session() as sess:
        #get_hotels(sess)
        get_rates(sess)

但是，我得到了这个结果:

However, I get this result:

<!DOCTYPE doctype html>

<html>
<head><script src="/common/js/marriottCommon.js" type="text/javascript"> </script>
<meta charset="utf-8">
</meta></head>
<body>
<script>
        var xhttp = new XMLHttpRequest();
        xhttp.addEventListener("load", function(a,b,c){
          window.location.reload()
        });
        xhttp.open('GET', '/reservation/availabilitySearch.mi?istl_enable=true&istl_data', true);
        xhttp.send();
      </script>
</body>
</html>

他们似乎试图阻止机器人抓取他们的数据，所以他们发回了一个脚本来重新加载页面，发送一个 XHR 请求，然后点击这个端点 http://www.marriott.com/reservation/rateListMenu.mi 来渲染网页.

It seems they are trying to prevent bots from scraping their data so they send back a script that reloads the page, sends an XHR request, and then hits this endpoint http://www.marriott.com/reservation/rateListMenu.mi to get render the webpage.

所以我尝试通过将我的 python 代码更改为这个来模拟返回的 javascript 的行为:

So I tried emulating the behavior of the javascript that is returned by changing my python code to this:

rate_list_ext = '/reservation/rateListMenu.mi'    
xhr_params = {
    'istl_enable': 'true',
    'istl_data': ''
}

def get_rates(sess):
    first_resp = sess.get(base_uri + availability_search_ext,
                          params=rate_params)
    rate_xhr_resp = sess.get(base_uri + availability_search_ext,
                             params=xhr_params)
    rate_list_resp = sess.get(base_uri + rate_list_ext)
    soup = BeautifulSoup(rate_list_resp.content, 'html.parser')

我正在使用所有参数发出初始 get 请求，然后我发出脚本正在发出的 xhr 请求，然后我向 rateListMenu.mi 端点发出请求以尝试获取最终的 html 页面，但我得到了会话超时响应.

I am making the initial get request with all the parameters, then I make the xhr request that the script is making, and then I make a request to the rateListMenu.mi endpoint to try to get the final html page but I get a session timed out response.

我什至与请求库进行了持久会话，以存储网站在阅读后返回的任何 cookie:使用 RoboBrowser 的不同网站响应

I even made a persistent session with the requests library to store any cookies that the website is returning after reading: Different web site response with RoboBrowser

我做错了什么?

使用 Requests/BeautifulSoup 抓取网站时绕过脚本响应 [英] Bypassing script response when scraping website with Requests/BeautifulSoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 Requests/BeautifulSoup 抓取网站时绕过脚本响应 [英] Bypassing script response when scraping website with Requests/BeautifulSoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭