Puppeteer 的行为不像在开发者控制台中那样 [英] Puppeteer not behaving like in Developer Console

查看:77
本文介绍了Puppeteer 的行为不像在开发者控制台中那样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Puppeteer 提取此页面的标题:https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106

I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106

我有以下代码,

          (async () => {
            const browser = await puppet.launch({ headless: true });
            const page = await browser.newPage();
            await page.goto(req.params[0]); //this is the url
            title = await page.evaluate(() => {
              Array.from(document.querySelectorAll("meta")).filter(function (
                el
              ) {
                return (
                  (el.attributes.name !== null &&
                    el.attributes.name !== undefined &&
                    el.attributes.name.value.endsWith("title")) ||
                  (el.attributes.property !== null &&
                    el.attributes.property !== undefined &&
                    el.attributes.property.value.endsWith("title"))
                );
              })[0].attributes.content.value ||
                document.querySelector("title").innerText;
            });

我已经使用浏览器控制台进行了测试,甚至使用了 Puppeteer 的 { headless: false } 选项.它在浏览器中按预期工作,但是当我实际使用 node 运行它时,它给了我以下错误.

which I have tested using the browser console and even using the { headless: false } option of Puppeteer. It works as expected in the browser, but when I actually run it with node it gives me the following error.

10:54:21 AM web.1 |  (node:10288) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'attributes' of undefined
10:54:21 AM web.1 |      at __puppeteer_evaluation_script__:14:20

因此,当我在浏览器中运行相同的 Array.from ...querySelectorAll("meta")... 查询时,我得到了预期的字符串:

So, when I run the same Array.from ...querySelectorAll("meta")... query in the browser I get the expected string:

"Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom"

我开始认为我在异步承诺方面做错了,因为那是不同的部分.有人能指出我正确的方向吗?

I'm starting to think I'm doing something wrong with the async promises, as that is the part that is different. Can anyone point me in the right direction?

按照建议,我使用 document.title 进行了测试,它应该在那里,但它也返回了 null.请参阅下面的代码和日志:

As suggested, I tested using document.title, which should be there, but it also returned null. See code and log below:

          console.log(
            "testing the return",
            (async () => {
              const browser = await puppet.launch({ headless: true });
              const page = await browser.newPage();
              await page.goto(req.params[0]); //this is the url
              try {
                title = await page.evaluate(() => {
                  const title = document.title;
                  const isTitleThere = title == null ? false : true;
                  //recently read that this checks for undefined as well as null but not an
                  //undeclared var
                  return {
                    title: title,
                    titleTitle: title.title,
                    isTitleThere: isTitleThere,
                  };
                });
              } catch (error) {
                console.log(error, "There was an error");
              }

11:54:11 AM web.1 |  testing the return Promise { <pending> }
11:54:13 AM web.1 |  { title: '', isTitleThere: true }

这与单页应用程序 bs 有关系吗?我认为 puppeteer 处理了这个问题,因为它首先加载所有内容.

Does this have to do with single-page application bs? I thought puppeteer handled that because it loads everything first.

我已经按照建议添加了网络空闲线并等待 8000 毫秒.标题还是空的.下面的代码和日志:

I have added the networkidle lines and await 8000 milliseconds, as suggested. Title is still empty. Code below and log:

            await page.goto(req.params[0], { waitUntil: "networkidle2" });
            await page.waitFor(8000);
            console.log("done waiting");
            title = await page.$eval("title", (el) => el.innerText);
            console.log("title: ", title);
            console.log("done retrieving");

12:36:39 PM web.1 |  done waiting
12:36:39 PM web.1 |  title:  
12:36:39 PM web.1 |  done retreiving

进步!!感谢大卫巴顿.似乎无头必须是假的才能工作?有谁知道为什么?

PROGRESS!! Thank you to theDavidBarton. It seems headless has to be false for it work? Does anyone know why?

推荐答案

如果你只需要 title 的 innerText 你可以用 page.$eval puppeteer 方法达到相同的结果:

If you only need the innerText of title you could do it with page.$eval puppeteer method to achieve the same result:

const title = await page.$eval('title', el => el.innerText)
console.log(title)

输出:

Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom

page.$$eval(selector, pageFunction[, ...args])

page.$eval 方法在页面内运行 Array.from(document.querySelectorAll(selector)) 并将其作为第一个参数传递给 pageFunction.

The page.$eval method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.

然而:您的主要问题是您正在访问的页面是一个用 React.Js 制作的单页应用程序 (SPA),它的 title 由 JavaScript 包动态填充.因此,当其内容很简单时,您的木偶操作者会在 中找到一个有效的 title 元素:""(一个空字符串).

However: your main problem is that the page you are visiting is a Single-Page App (SPA) made in React.Js, and its title is filled dynamically by the JavaScript bundle. So your puppeteer finds a valid title element in the <head> when its content is simply: "" (an empty string).

通常你应该使用 waitUntil: 'networkidle0' 在 SPA 的情况下,确保 DOM 由实际的 JS 框架正确填充并且功能齐全:

Normally you should use waitUntil: 'networkidle0' in case of SPAs to make sure the DOM is populated by the actual JS framework properly and it is fully functional:

await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle0'
  })

不幸的是,对于这个特定的网站,它会引发超时错误,因为网络连接在 30000 毫秒默认超时之前不会关闭,网页前端似乎有些问题(网络工作者处理?).

Unfortunately with this specific website it throws a timeout error as the network connections don't close until the 30000 ms default timeout, something seems to be not OK on the webpage's frontend side (webworker handling?).

作为一种解决方法,您可以在尝试检索 title 之前使用以下命令强制 puppeteer 休眠 8 秒:await page.waitFor(8000):时间它将被正确填充.实际上,当您在 DevTools Console 中运行脚本时,它会起作用,因为您没有立即运行脚本:那时页面已经完全加载,DOM 已填充.

As a workaround you can force puppeteer sleep for 8 seconds with: await page.waitFor(8000) before you try to retrieve the title: by that time it will be properly populated. Actually when you run your script in DevTools Console it works because you are not immediately running the script: that time the page is already fully loaded, DOM is populated.

此脚本将返回预期的标题:

This script will return the expected title:

async function fn() {
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle2'
  })
  await page.waitFor(8000)

  const title = await page.$eval('title', el => el.innerText)
  console.log(title)

  await browser.close()
}
fn()

也许 const browser = await puppeteer.launch({ headless: false }) 也会影响结果.

Maybe const browser = await puppeteer.launch({ headless: false }) affects the result as well.

这篇关于Puppeteer 的行为不像在开发者控制台中那样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆