网页正在使用Chromedriver作为机器人来检测Selenium Webdriver [英] Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot

查看:446
本文介绍了网页正在使用Chromedriver作为机器人来检测Selenium Webdriver的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python刮 https://www.controller.com/该页面使用pandas.get_html检测到了一个漫游器,并使用了用户代理和旋转代理进行了请求,我诉诸于使用selenium webdriver.但是,还会通过以下消息将其检测为机器人.有人可以解释一下我该如何克服吗?:

I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:

原谅我们的打扰... 当您浏览www.controller.com时,有关您的浏览器的某些信息使我们认为您是机器人.发生这种情况有几个原因: 您是超级人速浏览本网站的超级用户. 您已在网络浏览器中禁用了JavaScript. 第三方浏览器插件(例如Ghostery或NoScript)阻止了JavaScript的运行.此支持文章中提供了更多信息. 要请求取消阻止,请填写以下表格,我们将尽快对其进行审核.

Pardon Our Interruption... As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen: You're a power user moving through this website with super-human speed. You've disabled JavaScript in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article. To request an unblock, please fill out the form below and we will review it as soon as possible"

这是我的代码:

from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)

推荐答案

您仅在问题中提到了pandas.get_html,仅在代码中提到了options.add_argument('headless'),因此不确定是否要实现它们.但是,尝试从代码中取出最少的代码,如下所示:

You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:

  • 代码块:

  • Code Block:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://www.controller.com/')
print(driver.title)

我遇到了同样的问题.

  • 浏览器快照:

当我检查 HTML DOM 时,发现该网站引用了< window.onbeforeunload 上的strong> distil_referrer ,如下所示:

When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:

<script type="text/javascript" id="">
    window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>

快照:

这清楚地表明该网站受 Bot Management 服务提供商 的保护. > Distil Networks ,并检测到 ChromeDriver 进行的导航,随后阻止.

This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.

根据文章

Distil通过观察站点行为并识别刮板特有的模式来保护站点免受自动内容抓取机器人的攻击.当Distil在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并将其部署到所有客户.像漫游器防火墙一样,Distil会检测模式并做出反应.

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

进一步

"One pattern with Selenium was automating the theft of Web content",Distil首席执行官拉米·埃塞伊(Rami Essai)在上周的一次采访中表示. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


参考

您可以在以下位置找到一些详细的讨论:


Reference

You can find a couple of detailed discussion in:

  • Distil detects WebDriver driven Chrome Browsing Context
  • Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
  • Akamai Bot Manager detects WebDriver driven Chrome Browsing Context

这篇关于网页正在使用Chromedriver作为机器人来检测Selenium Webdriver的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆