网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver [英] Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot

查看:104
本文介绍了网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 抓取

当我检查

这清楚地表明该网站受到机器人管理服务提供商的保护Distil NetworksChromeDriver 的导航被检测到并随后被阻止.

<小时>

蒸馏

根据文章确实有一些关于 Distil.it 的东西......:

<块引用>

Distil 通过观察站点行为和识别抓取工具特有的模式来保护站点免受自动内容抓取机器人的侵害.当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并将其部署给所有客户.类似于机器人防火墙,Distil 会检测模式并做出反应.

进一步,

<块引用>

Selenium 的一个模式是自动窃取网络内容",Distil 首席执行官 Rami Essaid 上周在接受采访时说.即使他们可以创建新的机器人,我们还是找到了一种方法来识别 Selenium 是他们正在使用的工具,因此无论他们在该机器人上迭代多少次,我们都会阻止 Selenium.我们正在这样做现在使用 Python 和许多不同的技术.一旦我们看到一种模式从一种机器人中出现,我们就会对他们使用的技术进行逆向工程并将其识别为恶意".

<小时>

参考

您可以在以下位置找到一些详细的讨论:

I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:

Pardon Our Interruption... As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen: You're a power user moving through this website with super-human speed. You've disabled JavaScript in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article. To request an unblock, please fill out the form below and we will review it as soon as possible"

Here is my code:

from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)

解决方案

You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:UtilityBrowserDriverschromedriver.exe')
    driver.get('https://www.controller.com/')
    print(driver.title)
    

I have faced the same issue.

  • Browser Snashot:

When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:

<script type="text/javascript" id="">
    window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>

Snapshot:

This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

这篇关于网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆