Puppeteer-Cluster Stealthy 是否足以通过机器人测试? [英] Is Puppeteer-Cluster Stealthy enough to pass bot tests?

查看:23
本文介绍了Puppeteer-Cluster Stealthy 是否足以通过机器人测试?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人使用 Puppeteer-Cluster 可以详细说明 Cluster.Launch({settings}) 如何防止在不同上下文中的页面之间共享 cookie 和 Web 数据.

I wanted to know if anyone using Puppeteer-Cluster could elaborate on how the Cluster.Launch({settings}) protects against sharing of cookies and web data between pages in different context.

此处执行浏览器上下文,实际阻止 cookie 和用户数据不共享或跟踪?Browserless' 现在臭名昭著的页面似乎认为不,here 并且应该在任务上调用 .launch({}),而不是在队列之前.

Do the browser contexts here, actually block cookies and user-data is not shared or tracked? Browserless' now infamous page seems to think no, here and that .launch({}) should be called on the task, not ahead of the queue.

所以我的问题是,我们如何知道 puppeteer-cluster 是否在队列任务之间共享 cookie/数据?库中有哪些选项可以降低被标记为机器人的机会?

So my question is, how do we know if puppeteer-cluster is sharing cookies / data between queued tasks? And what kind of options are in the library to lower the chances of being labeled a bot?

设置:我将 page.authenticate 与代理服务、随机用户代理一起使用,但我正在执行测试的站点偶尔仍会被阻止 (403).

Setup: I am using page.authenticate with a proxy service, random user agent, and still getting blocked(403) occasionally by the site which I'm performing the test.

async function run() {
// Create a cluster with 2 workers
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_BROWSER, //Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 2, //5, //25, //the number of chromes open
    monitor: false, //true,
    puppeteerOptions: {
      executablePath,
      args: [
        "--proxy-server=pro.proxy.net:2222",
        "--incognito",
        "--disable-gpu",
        "--disable-dev-shm-usage",
        "--disable-setuid-sandbox",
        "--no-first-run",
        "--no-sandbox",
        "--no-zygote"
      ],
      headless: false,
      sameDomainDelay: 1000,
      retryDelay: 3000,
      workerCreationDelay: 3000
    }
  });

   // Define a task 
      await cluster.task(async ({ page, data: url }) => {
         extract(url, page); //call the extract
      });

   //task
      const extract = async ({ page, data: dataJson }) => {
         page.setExtraHTTPHeaders({headers})

         await page.authenticate({
           username: proxy_user, 
           password: proxy_pass
         });

       //Randomized Delay
         await delay(2000 + (Math.floor(Math.random() * 998) + 1));

         const response = await page.goto(dataJson.Url);
 }

//loop over inputs, and queue them into cluster
  var dataJson = {
      url: url
      };

  cluster.queue(dataJson, extract);

 }

 // Shutdown after everything is done
 await cluster.idle();
 await cluster.close();

}

推荐答案

直接回答

puppeteer-cluster 的作者在这里.该库不会主动阻止 cookie,而是使用 <代码>browser.createIncognitoBrowserContext():

Direct answer

Author of puppeteer-cluster here. The library does not actively block cookies, but makes use of browser.createIncognitoBrowserContext():

创建一个新的隐身浏览器上下文.这不会与其他浏览器上下文共享 cookie/缓存.

Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.

此外,文档指出隐身浏览器上下文不会将任何浏览数据写入磁盘"(source),以便重启浏览器不能重用磁盘中的任何 cookie,因为没有写入数据.

In addition, the docs state that "Incognito browser contexts don't write any browsing data to disk" (source), so that restarting the browser cannot reuse any cookies from disk as there were no data written.

关于库,这意味着在执行作业时,会创建一个新的隐身上下文,它不会与其他上下文共享任何数据(cookie 等).因此,只要 Chromium 正确实现隐身浏览器上下文,作业之间就不会共享数据.

Regarding the library, this means when a job is executed, a new incognito context is created, which does not share any data (cookies, etc.) with other contexts. So as long as Chromium properly implements the incognito browser contexts, there is no data shared between the jobs.

您链接的页面仅涉及 browser.newPage()(在页面之间共享 cookie),而不涉及隐身上下文.

The page you linked only talks about browser.newPage() (which shares cookies between pages) and not about incognito contexts.

某些网站仍会阻止您,因为它们使用不同的措施来检测机器人.如果用户代理与浏览器不匹配,有 无头浏览器检测测试以及指纹库可能会将您报告为机器人指纹.您可能对我的这个答案感兴趣,它提供了一些更详细的解释这些指纹如何工作.

Some websites will still block you, because they use different measures to detect bots. There are headless browser detection tests as well as fingerprinting libraries that might report you as bot if the user agent does not match the browser fingerprint. You might be interested in this answer by me that provides some more detailed explanation how these fingerprints work.

您可以尝试使用 puppeteer-extra 之类的库附带 stealth 插件来帮助您解决问题.然而,这基本上是一场猫捉老鼠的游戏.指纹测试可能会改变,或者其他站点可能会使用不同的检测"机制.总而言之,无法保证网站不会检测到您.

You can try to use a library like puppeteer-extra that comes with a stealth plugin to help you solve the problem. However, this basically is a cat-and-mouse game. The fingerprinting tests might be changed or another sites might use a different "detection" mechanism. All-in-all, there is no way to guarantee that a website does not detect you.

如果您想使用 puppeteer-extra,请注意您可以将它与 puppeteer-cluster (示例代码).

In case you want to use puppeteer-extra, be aware that you can use it in conjunction with puppeteer-cluster (example code).

这篇关于Puppeteer-Cluster Stealthy 是否足以通过机器人测试?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆