为什么要使 Puppeteer 工作,headless 需要为 false? [英] Why does headless need to be false for Puppeteer to work?

查看:68
本文介绍了为什么要使 Puppeteer 工作,headless 需要为 false?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个 Web api,用于抓取给定的 url 并将其发回.我正在使用 Puppeteer 来做到这一点.我问了这个问题:Puppeteer 不像在开发者控制台中那样>

并收到了一个答案,表明它只有在 headless 设置为 false 时才有效.我不想经常打开我不需要的浏览器 UI(我只需要数据!)所以我在寻找为什么 headless 必须是 false,我能得到一个让 headless = true 的修复吗?.

这是我的代码:

express().get("/*", (req, res) => {global.notBaseURL = req.params[0];(异步() => {const browser = await puppet.launch({ headless: false });//兴趣线const page = await browser.newPage();console.log(req.params[0]);等待 page.goto(req.params[0], { waitUntil: "networkidle2" });//这是网址title = await page.$eval(title", (el) => el.innerText);浏览器关闭();重新发送({标题:标题,});})();}).listen(PORT, () => console.log(`Listening on ${PORT}`));

这是我要抓取的页面:

可能在 UI 模式下工作但不是无头的原因是积极打击抓取的网站会检测到您正在运行无头浏览器.

一些可能的解决方法:

使用puppeteer-extra

在这里找到:https://github.com/berstend/puppeteer-extra查看他们的文档以了解如何使用它.它有几个插件可能有助于通过无头模式检测:

  1. puppeteer-extra-plugin-anonymize-ua -- 匿名化您的用户代理.请注意,这可能有助于通过无头模式检测,但如果您访问 https://amiunique.org/,您会看到 它不太可能足以让您不被识别为回头客.
  2. puppeteer-extra-plugin-stealth -- 这可能有助于赢得不被检测为无头的猫捉老鼠游戏.有许多技巧可用于检测无头模式,也有许多技巧可以避开它们.

运行真正的"Chromium 实例/用户界面

可以通过将 puppeteer 附加到正在运行的实例的方式运行单个浏览器 UI.这是一篇解释它的文章:https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0

本质上,您使用 --remote-debugging-port=9222(或任何旧端口?)以及其他命令行开关从命令行启动 Chrome 或 Chromium(或 Edge?)在您运行它的环境中.然后您使用 puppeteer 连接到该正在运行的实例,而不是让它执行启动无头 Chromium 实例的默认行为: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL});.阅读此处的 puppeteer 文档了解更多信息:https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions

当您使用 --remote-debugging-port=9222 选项从命令行启动浏览器时,终端中会显示 ENDPOINT_URL.

这个选项需要一些服务器/操作的魔力,所以准备做更多的堆栈溢出搜索.:-)

我确定还有其他策略,但这是我最熟悉的两种.祝你好运!

I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console

and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.

Here's my code:

express()
  .get("/*", (req, res) => {
    global.notBaseURL = req.params[0];
    (async () => {
      const browser = await puppet.launch({ headless: false }); // Line of Interest
      const page = await browser.newPage();
      console.log(req.params[0]);
      await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
      title = await page.$eval("title", (el) => el.innerText);

      browser.close();

      res.send({
        title: title,
      });
    })();
  })
  .listen(PORT, () => console.log(`Listening on ${PORT}`));

This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK

解决方案

The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.

Some possible workarounds:

Use puppeteer-extra

Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:

  1. puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.
  2. puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.

Run a "real" Chromium instance/UI

It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0

Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions

The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.

This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)

There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!

这篇关于为什么要使 Puppeteer 工作,headless 需要为 false?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆