Selenium 和非无头浏览器不断要求验证码 [英] Selenium and non-headless browser keeps asking for Captcha

查看:57
本文介绍了Selenium 和非无头浏览器不断要求验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,我们的一个网站在云端浏览器中的无头模式下不断要求验证码,所以我将其切换为非无头模式,这样我就可以自己输入验证码,我想下一个有时它会起作用,也许是因为已经存储了一些 cookie,但即使我多次输入验证码,它也没有.

I was running into an issue in which one of our sites kept asking for captcha in headless mode in a browser in the cloud, so I switched it to non headless, so I could enter the captcha myself, and I thought the next times it would work, perhaps because some cookies would be stored already, but it didn't even though I entered the captcha several times.

另外值得一提的是,它在任何模式下都可以在本地正常运行,对于非自动化版本,它也可以在云中运行良好,但是一旦我在任何模式下使用 Selenium 运行它,它就会不断要求验证码.非常感谢可能发生的任何想法和解决方案的想法

Also it's worth mentioning that it runs just fine locally in whatever mode, and it also runs well in the cloud for the non automated version, but as soon as as I run l it there with Selenium in whatever mode it keeps asking for the captcha. Any ideas what might be happening and ideas on the solution are greatly appreciated

推荐答案

在题为 recaptcha 3 如何知道我正在使用 selenium/chromedriver 我们已经讨论了一些通用方法,以避免在网页抓取时被检测到.让我们深入了解一下.

In the discussion entitled How does recaptcha 3 know I'm using selenium/chromedriver we have discussed about some generic approaches to avoid getting detected while web-scraping. Let's deep dive.

无头浏览器是一种无需图形界面即可使用的浏览器.它可以通过编程方式控制以自动执行任务,例如进行测试或截取网页截图.

A headless browser is a browser that can be used without a graphical interface. It can be controlled programmatically to automate tasks, such as doing tests or taking screenshots of webpages.

根据 @AntoineVastel,无头浏览器用于自动执行恶意任务.最常见的情况是网页抓取、增加广告印象或在网站上寻找漏洞.

As per @AntoineVastel, headless browsers are used to automate malicious tasks. The most common cases are web scraping, increase advertisement impressions or look for vulnerabilities on a website.

直到一年前,最流行的无头浏览器之一是 PhantomJS.由于它建立在 Qt 框架之上,因此与大多数流行的浏览器相比,它表现出许多不同之处.可以使用一些浏览器指纹技术来检测 PhantomJS.自 59 版以来,谷歌发布了其 Chrome 浏览器的无头版本.与 PhantomJS 不同的是,它基于普通的 Chrome,而不是外部框架,这使得它的存在更难以检测.因此,可能还有其他方法可以检测 Chrome 无头.

Until an year ago, one of the most popular headless browser was PhantomJS. Since it is built on the Qt framework, it exhibits many differences compared to most popular browsers. It was possible to detect PhantomJS using some browser fingerprinting techniques. Since version 59, Google released a headless version of its Chrome browser. Unlike PhantomJS, it is based on a vanilla Chrome, and not on an external framework, making its presence more difficult to detect. So there are likely other ways to detect Chrome headless.

  • 用户代理:用户代理属性通常用于检测用户的操作系统和浏览器.对于 Chrome 59 版,它具有以下值:

  • User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36

  • 可以通过以下方式检查 Chrome headless 是否存在:

    if (/HeadlessChrome/.test(window.navigator.userAgent)) {
        console.log("Chrome headless detected");
    }
    

  • 插件:navigator.plugins 返回浏览器中存在的插件数组.通常,在 Chrome 上我们会找到默认插件,例如 Chrome PDF viewerGoogle Native Client.相反,在无头模式下,返回的数组包含没有插件.

    Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.

    • 可以通过以下方式检查插件是否存在:

    if(navigator.plugins.length == 0) {
        console.log("It may be Chrome headless");
    }
    

    Languages:Chrome 中有两个 Javascript 属性可以获取用户使用的语言:navigator.languagenavigator.languages.第一个是浏览器 UI 的语言,而第二个是代表用户首选语言的字符串数组.但是,在无头模式下,navigator.languages 返回一个 字符串.

    Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.

    • 可以通过以下方式检查语言是否存在:

    if(navigator.languages == "") {
         console.log("Chrome headless detected");
    }
    

    WebGL:WebGL 是一种在 HTML 画布中执行 3D 渲染的 API.使用此 API,可以查询图形驱动程序的供应商以及图形驱动程序的渲染器.使用普通的 Chrome 和 Linux,我们可以获得以下渲染器和供应商的值:Google SwiftShaderGoogle Inc..在headless模式下,我们可以获得Mesa OffScreen,这是一种不使用任何窗口系统的渲染技术和Brian Paul,就是启动开源Mesa图形库的程序.

    WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.

    • 可以通过以下方式检查 WebGL 是否存在:

    var canvas = document.createElement('canvas');
    var gl = canvas.getContext('webgl');
    
    var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
    var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
    var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
    
    if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
        console.log("Chrome headless detected");
    }
    

  • 并非所有无头 Chrome 的供应商和渲染器都具有相同的值.其他人保留也可以在非无头版本上找到的值.但是,Mesa OffscreenBrian Paul 表明存在无头版本.

  • Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.

    浏览器功能:Modernizr 库能够测试浏览器中是否存在各种 HTML 和 CSS 功能.我们发现 Chrome 和无头 Chrome 之间的唯一区别是后者没有细线功能,可以检测对 hidpi/retina 细线 的支持.

    Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.

    • 可以通过以下方式检查是否存在细线特征:

    if(!Modernizr["hairline"]) {
        console.log("It may be Chrome headless");
    }
    

  • 缺失图像:我们列表中的最后一个似乎也是最可靠的,来自 Chrome 使用的图像尺寸,以防无法加载图像.对于普通 Chrome,图像的宽度和高度取决于浏览器的缩放比例,但不为零.在无头 Chrome 中,图像的宽度和高度为零.

    Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.

    • 可以通过以下方式检查是否存在缺失图像:

    var body = document.getElementsByTagName("body")[0];
    var image = document.createElement("img");
    image.src = "http://iloveponeydotcom32188.jg";
    image.setAttribute("id", "fakeimage");
    body.appendChild(image);
    image.onerror = function(){
        if(image.width == 0 && image.height == 0) {
        console.log("Chrome headless detected");
        }
    }
    

    这些是无头浏览器更容易被检测到的一些关键因素.

    These are some of the crucial factors why headless browsers are more prone to get detected.

    这篇关于Selenium 和非无头浏览器不断要求验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆