硒和非无头浏览器不断要求验证码 [英] Selenium and non-headless browser keeps asking for Captcha

查看:193
本文介绍了硒和非无头浏览器不断要求验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,我们的一个站点一直在云浏览器中以无头模式请求验证码,因此我将其切换为非无头,以便我自己输入验证码,我想有时它会工作,也许是因为已经存储了一些cookie,但是即使我多次输入验证码也没有.

I was running into an issue in which one of our sites kept asking for captcha in headless mode in a browser in the cloud, so I switched it to non headless, so I could enter the captcha myself, and I thought the next times it would work, perhaps because some cookies would be stored already, but it didn't even though I entered the captcha several times.

另外值得一提的是,它在任何模式下都可以在本地正常运行,对于非自动化版本,它在云中也可以很好地运行,但是一旦我在Selenium上以任何模式运行它,它就会一直要求验证码.任何想法可能发生的事情以及解决方案上的想法都将受到赞赏

Also it's worth mentioning that it runs just fine locally in whatever mode, and it also runs well in the cloud for the non automated version, but as soon as as I run l it there with Selenium in whatever mode it keeps asking for the captcha. Any ideas what might be happening and ideas on the solution are greatly appreciated

推荐答案

在标题为

In the discussion entitled How does recaptcha 3 know I'm using selenium/chromedriver we have discussed about some generic approaches to avoid getting detected while web-scraping. Let's deep dive.

无头浏览器是无需图形界面即可使用的浏览器.可以通过编程方式对其进行控制,以自动化任务,例如进行测试或拍摄网页屏幕截图.

A headless browser is a browser that can be used without a graphical interface. It can be controlled programmatically to automate tasks, such as doing tests or taking screenshots of webpages.

根据 @AntoineVastel ,无头浏览器用于自动执行恶意任务.最常见的情况是抓取网页,增加广告展示次数或在网站上查找漏洞.

As per @AntoineVastel, headless browsers are used to automate malicious tasks. The most common cases are web scraping, increase advertisement impressions or look for vulnerabilities on a website.

直到一年前,最流行的无头浏览器之一就是PhantomJS.由于它是基于Qt框架构建的,因此与大多数流行的浏览器相比,它表现出许多差异.使用某些浏览器指纹技术可以检测到PhantomJS.从59版开始,Google发布了无头版Chrome浏览器.与PhantomJS不同,它基于香草Chrome,而不是基于外部框架,因此更难检测到它的存在.因此,可能还有其他方法可以检测无头的Chrome.

Until an year ago, one of the most popular headless browser was PhantomJS. Since it is built on the Qt framework, it exhibits many differences compared to most popular browsers. It was possible to detect PhantomJS using some browser fingerprinting techniques. Since version 59, Google released a headless version of its Chrome browser. Unlike PhantomJS, it is based on a vanilla Chrome, and not on an external framework, making its presence more difficult to detect. So there are likely other ways to detect Chrome headless.

  • 用户代理:用户代理属性通常用于检测用户的操作系统和浏览器.在Chrome版本59中,它具有以下值:

  • User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36

  • 可以通过以下方法检查 Chrome无头的存在:

    • A check for the presence of Chrome headless can be done through:

      if (/HeadlessChrome/.test(window.navigator.userAgent)) {
          console.log("Chrome headless detected");
      }
      

    • 插件: navigator.plugins 返回浏览器中存在的一系列插件.通常,在Chrome上,我们会找到默认插件,例如Chrome PDF viewerGoogle Native Client.相反,在无头模式下,返回的数组包含插件.

      Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.

      • 可以通过以下方法检查插件的存在:

      if(navigator.plugins.length == 0) {
          console.log("It may be Chrome headless");
      }
      

      语言:在Chrome中,两个Javascript属性可用于获取 user: navigator.language navigator.languages 所使用的语言.第一个是浏览器用户界面的语言,第二个是代表用户首选语言的字符串数组.但是,在无头模式下, navigator.languages 返回一个字符串.

      Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.

      • 可以通过以下方法检查语言的存在:

      if(navigator.languages == "") {
           console.log("Chrome headless detected");
      }
      

      WebGL :WebGL是用于在HTML画布中执行3D渲染的API.使用此API,可以查询图形驱动程序的供应商以及图形驱动程序的渲染器.使用普通的Chrome和Linux,我们可以获得渲染器和供应商的以下值: Google SwiftShader Google Inc. .在无头模式下,我们可以获得 Mesa OffScreen (这是一种无需使用任何窗口系统即可进行渲染的技术)和 Brian Paul (这是启动的程序)开源的Mesa图形库.

      WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.

      • 可以通过以下方法检查 WebGL 的存在:

      var canvas = document.createElement('canvas');
      var gl = canvas.getContext('webgl');
      
      var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
      var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
      var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
      
      if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
          console.log("Chrome headless detected");
      }
      

    • 并非所有的无头Chrome都具有相同的供应商和渲染器值.其他人保留的值也可以在非无头版本中找到.但是, Mesa Offscreen Brian Paul 表示存在无头版本.

    • Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.

      浏览器功能:Modernizr库可以测试浏览器中是否存在各种HTML和CSS功能.我们发现Chrome和无头Chrome之间的唯一区别是后者没有发际线功能,该功能检测到对 hidpi/retina hairlines 的支持.

      Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.

      • 可以通过以下方法检查发际线特征的存在:

      if(!Modernizr["hairline"]) {
          console.log("It may be Chrome headless");
      }
      

    • 缺少图像:我们列表中的最后一个似乎也最可靠,它来自Chrome所使用的图像尺寸,以防无法加载图像.在使用普通Chrome浏览器的情况下,图像的宽度和高度取决于浏览器的缩放比例,但不为零.在无头Chrome中,图片的宽度和高度等于零.

      Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.

      • 可以通过以下方法检查是否存在缺少图像:

      var body = document.getElementsByTagName("body")[0];
      var image = document.createElement("img");
      image.src = "http://iloveponeydotcom32188.jg";
      image.setAttribute("id", "fakeimage");
      body.appendChild(image);
      image.onerror = function(){
          if(image.width == 0 && image.height == 0) {
          console.log("Chrome headless detected");
          }
      }
      

      这些是为什么无头浏览器更容易被检测到的一些关键因素.

      These are some of the crucial factors why headless browsers are more prone to get detected.

      • Detecting PhantomJS Based Visitors
      • Unable to use Selenium to automate Chase site login
      • Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

      这篇关于硒和非无头浏览器不断要求验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆