Puppeteer-Cluster Stealthy是否足以通过机器人测试? [英] Is Puppeteer-Cluster Stealthy enough to pass bot tests?

查看:311
本文介绍了Puppeteer-Cluster Stealthy是否足以通过机器人测试?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人使用Puppeteer-Cluster可以详细说明Cluster.Launch({settings})如何防止在不同上下文中的页面之间共享Cookie和Web数据.

I wanted to know if anyone using Puppeteer-Cluster could elaborate on how the Cluster.Launch({settings}) protects against sharing of cookies and web data between pages in different context.

在浏览器上下文中此处,实际上阻止了cookie和用户数据不共享或跟踪?现在无名的无浏览器页面似乎认为没有,

Do the browser contexts here, actually block cookies and user-data is not shared or tracked? Browserless' now infamous page seems to think no, here and that .launch({}) should be called on the task, not ahead of the queue.

所以我的问题是,我们如何知道puppeteer-cluster是否在排队的任务之间共享cookie/数据?库中有哪些选项可以降低被标记为机器人的可能性?

So my question is, how do we know if puppeteer-cluster is sharing cookies / data between queued tasks? And what kind of options are in the library to lower the chances of being labeled a bot?

设置:我正在使用page.authentication与代理服务,随机用户代理进行身份验证,偶尔仍会被执行测试的网站阻止(403).

Setup: I am using page.authenticate with a proxy service, random user agent, and still getting blocked(403) occasionally by the site which I'm performing the test.

async function run() {
// Create a cluster with 2 workers
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_BROWSER, //Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 2, //5, //25, //the number of chromes open
    monitor: false, //true,
    puppeteerOptions: {
      executablePath,
      args: [
        "--proxy-server=pro.proxy.net:2222",
        "--incognito",
        "--disable-gpu",
        "--disable-dev-shm-usage",
        "--disable-setuid-sandbox",
        "--no-first-run",
        "--no-sandbox",
        "--no-zygote"
      ],
      headless: false,
      sameDomainDelay: 1000,
      retryDelay: 3000,
      workerCreationDelay: 3000
    }
  });

   // Define a task 
      await cluster.task(async ({ page, data: url }) => {
         extract(url, page); //call the extract
      });

   //task
      const extract = async ({ page, data: dataJson }) => {
         page.setExtraHTTPHeaders({headers})

         await page.authenticate({
           username: proxy_user, 
           password: proxy_pass
         });

       //Randomized Delay
         await delay(2000 + (Math.floor(Math.random() * 998) + 1));

         const response = await page.goto(dataJson.Url);
 }

//loop over inputs, and queue them into cluster
  var dataJson = {
      url: url
      };

  cluster.queue(dataJson, extract);

 }

 // Shutdown after everything is done
 await cluster.idle();
 await cluster.close();

}

推荐答案

直接回答

此处是puppeteer-cluster的作者.该库不会主动阻止cookie,但会使用 browser.createIncognitoBrowserContext() :

Direct answer

Author of puppeteer-cluster here. The library does not actively block cookies, but makes use of browser.createIncognitoBrowserContext():

创建一个新的隐身浏览器上下文.这不会与其他浏览器上下文共享cookie/缓存.

Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.

此外,文档指出隐身浏览器上下文不会将任何浏览数据写入磁盘"(

In addition, the docs state that "Incognito browser contexts don't write any browsing data to disk" (source), so that restarting the browser cannot reuse any cookies from disk as there were no data written.

关于库,这意味着在执行作业时,将创建一个新的隐身上下文,该隐身上下文不会与其他上下文共享任何数据(Cookie等).因此,只要Chromium正确实现了隐身浏览器上下文,作业之间就不会共享数据.

Regarding the library, this means when a job is executed, a new incognito context is created, which does not share any data (cookies, etc.) with other contexts. So as long as Chromium properly implements the incognito browser contexts, there is no data shared between the jobs.

您链接的页面仅谈论browser.newPage()(在页面之间共享cookie),而不涉及隐身上下文.

The page you linked only talks about browser.newPage() (which shares cookies between pages) and not about incognito contexts.

某些网站仍然会阻止您,因为它们使用不同的方法来检测漫游器.有无头浏览器检测测试,以及指纹库,如果用户代理与浏览器不匹配,它们可能会将您报告为机器人.指纹.您可能对我的此答案感兴趣,该问题提供了有关这些指纹如何显示的更多详细说明工作.

Some websites will still block you, because they use different measures to detect bots. There are headless browser detection tests as well as fingerprinting libraries that might report you as bot if the user agent does not match the browser fingerprint. You might be interested in this answer by me that provides some more detailed explanation how these fingerprints work.

您可以尝试使用 puppeteer-extra 之类的库,该库随附 stealth 插件可以帮助您解决问题.但是,这基本上是一个猫捉老鼠的游戏.指纹测试可能会更改,或者其他站点可能会使用其他检测"机制.总而言之,无法保证网站不会检测到您.

You can try to use a library like puppeteer-extra that comes with a stealth plugin to help you solve the problem. However, this basically is a cat-and-mouse game. The fingerprinting tests might be changed or another sites might use a different "detection" mechanism. All-in-all, there is no way to guarantee that a website does not detect you.

如果要使用puppeteer-extra,请注意,可以将其与puppeteer-cluster(

In case you want to use puppeteer-extra, be aware that you can use it in conjunction with puppeteer-cluster (example code).

这篇关于Puppeteer-Cluster Stealthy是否足以通过机器人测试?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆