在 Puppeteer 中修改 page.evaluate 中的对象时遇到问题 [英] Trouble modifying an object within page.evaluate in Puppeteer

查看:37
本文介绍了在 Puppeteer 中修改 page.evaluate 中的对象时遇到问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个 Twitter 抓取工具作为一个项目.当您向下滚动时,推文会在 DOM 中呈现,因此我想使用 Puppeteer 来滚动、提取数据并将其保存到预定义的对象中,然后继续滚动.问题是脚本实际上并没有修改提供的对象,我只剩下一个空对象.

I am creating a Twitter scraper as a project. Tweets are rendered in the DOM as you scroll down so I want to use Puppeteer to scroll, extract data and save it into a predefined object, then continue scrolling. The problem is that the script is not actually modifying the object provided and I am left with an empty object.

提取数据的 for 循环在滚动函数外部调用时起作用(即我可以提取页面中呈现的第一条推文).滚动功能本身有效,我从 Puppeteer - 向下滚动直到你可以'不用了 .

The for loop to extract data works when called outside the scrolling function (i.e. I can extract the first tweets rendered in the page). The scrolling function itself works, I got it from Puppeteer - scroll down until you can't anymore .

出于测试目的,我将滚动功能设置为仅滚动 20 次(否则将滚动到无法滚动为止).这是我的代码:

For testing purposes I set the scrolling function to only scroll 20 times (it is otherwise designed to scroll until it can't scroll anymore). Here is my code:

app.get('/scrape', async (req, res) => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setJavaScriptEnabled(true)
    await page.goto(`https://twitter.com/${req.query.url}`);
    await page.setJavaScriptEnabled(true)
    let obj = {}
    await autoScroll(page, obj)
    async function autoScroll(page, obj) {
        await page.evaluate(async (obj) => {
            await new Promise((resolve, reject) => {
                var totalHeight = 0;
                var distance = 400;
                var count = 0
                var timer = setInterval(() => {
                    var scrollHeight = document.body.scrollHeight;
                    window.scrollBy(0, distance);
                    totalHeight += distance;
                    for (let i = 0; i < 100; i++) {
                        let id, date, text
                        try {
                            id = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].getAttribute('data-tweet-id')
                            date = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[1].getAttribute('title')
                            text = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].childNodes[3].childNodes[3].childNodes[1].innerHTML
                            obj[id] = { date: date, text: text }
                            console.log(i)
                        } catch (err) { continue }
                    }
                    count++
                    //if(totalHeight >= scrollHeight){
                    if (count === 20) {
                        clearInterval(timer);
                        resolve();
                    }
                }, 400);
            });
        }, obj);
    }
    res.send(obj)
    await browser.close();
})

请求每次发送一个空对象.我没有收到任何错误消息或控制台日志;如果它们在那里,我就看不到它们,因为它们是在无头 Chrome 浏览器的上下文中执行的,而不是 Puppeteer 生成的.

The request sends an empty object every time. I don't receive any error messages or console logs; if they are there, I can't see them because they are executed in the context of the headless Chrome browser than Puppeteer generates.

任何帮助将不胜感激!

推荐答案

您传递给 page.evaluate 的参数将被 JSON 序列化并传输到页面上下文.

The arguments you pass to page.evaluate will be JSON-serialized and transferred to the page context.

您在 page.evaluate() 函数中分配给 obj 的属性只会出现在页面上下文中,而不出现在您调用 的脚本中page.evaluate.

The properties you assign to obj in your page.evaluate() function will only be present in the page context, not in the script where you called page.evaluate.

您可以通过从函数返回 obj 对象而不是将其作为参数传递来解决此问题:

You can work around this by returning the obj object from the function instead of passing it as parameter:

let obj = await page.evaluate(async() => {
  return new Promise(resolve => {
      let obj = {};
      // ...
      // set something on obj
      obj['foo'] = 'bar';

      // resolve with the obj
      resolve(obj);
      // ...
  });
});

集成在您的代码片段中:

Integrated in your code snippet:

app.get('/scrape', async (req, res) => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setJavaScriptEnabled(true)
    await page.goto(`https://twitter.com/${req.query.url}`);
    await page.setJavaScriptEnabled(true)
    let obj = await autoScroll(page);
    async function autoScroll(page) {
        return page.evaluate(async () => {
            let obj = {};
            return new Promise((resolve, reject) => {
                var totalHeight = 0;
                var distance = 400;
                var count = 0
                var timer = setInterval(() => {
                    var scrollHeight = document.body.scrollHeight;
                    window.scrollBy(0, distance);
                    totalHeight += distance;
                    for (let i = 0; i < 100; i++) {
                        let id, date, text
                        try {
                            id = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].getAttribute('data-tweet-id')
                            date = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[1].getAttribute('title')
                            text = document.body.childNodes[7].childNodes[3].childNodes[1].childNodes[5].childNodes[1].childNodes[1].childNodes[3].childNodes[1].childNodes[3].childNodes[7].childNodes[1].childNodes[3].childNodes[1].childNodes[i].childNodes[1].childNodes[3].childNodes[3].childNodes[1].innerHTML
                            obj[id] = { date: date, text: text }
                            console.log(i)
                        } catch (err) { continue }
                    }
                    count++
                    //if(totalHeight >= scrollHeight){
                    if (count === 20) {
                        clearInterval(timer);
                        resolve(obj);
                    }
                }, 400);
            });
        });
    }
    res.send(obj)
    await browser.close();
})

如果您使用像 babel 这样的转译器,您可能需要将函数作为字符串传递给 page.evaluate,例如:

If you're using a transpiler like babel you might need to pass the function as a string to page.evaluate, e.g.:

await page.evaluate(`async() => {
  return Promise.resolve(42);
}`);

(puppeteer 将在您的函数上调用 .toString() 以获取源代码,其中可能包含 对 babel 使用的助手的引用,在页面上下文中不存在)

(puppeteer will call .toString() on your function to get the source, which might contain references to helpers used by babel, which aren't present in the page context)


要调试您的选择器,您可以尝试在 非无头模式 中启动 puppeteer.这样你就可以得到一个真正的浏览器窗口,你可以在其中访问开发控制台.例如:


To debug your selectors you can try to launch puppeteer in non-headless mode. That way you get a real browser window where you can access the dev console. e.g.:

const browser = await puppeteer.launch({headless: false});

这篇关于在 Puppeteer 中修改 page.evaluate 中的对象时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆