加载资源失败:服务器通过 Selenium 使用 ChromeDriver Chrome 响应状态为 429(请求过多)和 404(未找到) [英] Failed to load resource: the server responded with a status of 429 (Too Many Requests) and 404 (Not Found) with ChromeDriver Chrome through Selenium

查看:68
本文介绍了加载资源失败:服务器通过 Selenium 使用 ChromeDriver Chrome 响应状态为 429(请求过多)和 404(未找到)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 python 中使用 selenium 构建一个刮板.Selenium webdriver 打开窗口并尝试加载页面但突然停止加载.我可以在本地 chrome 浏览器中访问相同的链接.

这是我从 webdriver 获得的错误日志:

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-sharing?pageNumber=1 - 加载资源失败:服务器响应状态为 429(请求过多)','source':'network','timestamp':1556997743637}{'level':'SEVERE','message':'about:blank - 无法加载资源:net::ERR_UNKNOWN_URL_SCHEME','source':'network','timestamp':1556997745338}{'级别':'严​​重','消息':'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - 失败加载资源:服务器响应状态为 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

我的脚本:

从 selenium 导入 webdriver导入操作系统路径 = os.path.join(os.getcwd(), 'chromedriver')driver = webdriver.Chrome(executable_path=path)链接= ["https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-chang?pageNumber=1","https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1","https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1","https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",]对于链接中的链接:driver.get(链接)

解决方案

429 Too Many Requests

HTTP 429 Too Many Requests 响应状态码表示用户在给定时间内发送了太多请求(速率限制").响应表示应该包含解释条件的详细信息,并且可以包含一个 Retry-After 标头,指示在发出新请求之前要等待多长时间.

当服务器受到攻击或刚刚收到来自单方的大量请求时,以 429 状态码响应每个请求都会消耗资源.因此,服务器不需要使用 429 状态码;在限制资源使用时,直接断开连接或采取其他措施可能更合适.

<小时>

404 未找到

HTTP 404 Not Found 客户端错误响应代码表示服务器找不到请求的资源.在浏览器中,这意味着无法识别 URL.在 API 中,这也可能意味着端点有效但资源本身不存在.服务器也可以发送此响应而不是 403,以向未经授权的客户端隐藏资源的存在.此响应代码可能是最著名的响应代码,因为它在网络上频繁出现.

404 状态码并不表示资源是暂时丢失还是永久丢失.但是如果一个资源被永久删除,则应该使用 410 (Gone) 而不是 404 状态.此外,404 状态码用于未找到请求的资源,无论它不存在还是存在 401403出于安全原因,该服务想要屏蔽.

<小时>

分析

当我尝试您的代码块时,我遇到了类似的后果.如果您检查 DOM 树.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-chang?pageNumber=1" rel="nofollow noreferrer">webpage 你会发现不少标签是有关键字dist.举个例子:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="屏幕">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

dist 一词的存在清楚地表明该网站受到 Bot Management 服务提供商 Distil NetworksChromeDriver 的导航被检测到并随后被阻止.p><小时>

蒸馏

根据文章 Distil.it 确实有些东西...:

<块引用>

Distil 通过观察网站行为和识别抓取工具特有的模式来保护网站免受自动内容抓取机器人的侵害.当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并部署到其所有客户.类似于机器人防火墙的东西,Distil 检测模式并做出反应.

进一步,

<块引用>

使用 **Selenium** 的一种模式是自动窃取 Web 内容",Distil 首席执行官 Rami Essaid 在上周接受采访时表示.尽管他们可以创建新的机器人,但我们找到了一种方法来识别 Selenium 是他们正在使用的工具,因此无论他们在该机器人上迭代多少次,我们都会阻止 Selenium.我们正在这样做现在有了 Python 和许多不同的技术.一旦我们看到一种模式从一种类型的机器人中出现,然后我们就会对他们使用的技术进行逆向工程,并将其识别为恶意".

<小时>

参考

您可以在以下位置找到一些详细的讨论:

I am trying to build a scraper using selenium in python. Selenium webdriver opening window and trying to load the page but suddenly stop loading. I can access the same link in my local chrome browser.

Here are the error logs I'm getting from the webdriver:

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1 - Failed to load resource: the server responded with a status of 429 (Too Many Requests)', 'source': 'network', 'timestamp': 1556997743637}

{'level': 'SEVERE', 'message': 'about:blank - Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME', 'source': 'network', 'timestamp': 1556997745338}

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

My script:

from selenium import webdriver
import os

path = os.path.join(os.getcwd(), 'chromedriver')
driver = webdriver.Chrome(executable_path=path)

links = [
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",
]


for link in links:
    driver.get(link)

解决方案

429 Too Many Requests

The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After header indicating how long to wait before making a new request.

When a server is under attack or just receiving a very large number of requests from a single party, responding to each with a 429 status code will consume resources. Therefore, servers are not required to use the 429 status code; when limiting resource usage, it may be more appropriate to just drop connections, or take other steps.


404 Not Found

The HTTP 404 Not Found client error response code indicates that the server can not find requested resource. In the browser, this means the URL is not recognized. In an API, this can also mean that the endpoint is valid but the resource itself does not exist. Servers may also send this response instead of 403 to hide the existence of a resource from an unauthorized client. This response code is probably the most famous one due to its frequent occurence on the web.

A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, a 410 (Gone) should be used instead of a 404 status. Additionally, 404 status code is used when the requested resource is not found, whether it doesn't exist or if there was a 401 or 403 that, for security reasons, the service wants to mask.


Analysis

When I tried your code block, I faced similar consequences. If you inspect the DOM Tree of the webpage you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

The presence of the term dist is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

这篇关于加载资源失败:服务器通过 Selenium 使用 ChromeDriver Chrome 响应状态为 429(请求过多)和 404(未找到)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆