Puppeteer:从使用延迟加载的页面中抓取整个 html [英] Puppeteer: Grabbing entire html from page that uses lazy load

查看：106 发布时间：2021/6/23 19:02:18 javascript node.js web-scraping puppeteer

本文介绍了Puppeteer:从使用延迟加载的页面中抓取整个 html的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在使用延迟加载的网页上抓取整个 html.我所尝试的是一直滚动到底部，然后使用 page.content().我还尝试在滚动到底部然后使用 page.content() 后滚动回页面顶部.两种方式都抓取表格的一些行，但不是全部，这是我的主要目标.我相信网页使用了 react.js 的延迟加载.

I am trying to grab the entire html on a web page that uses lazy load. What I have tried is scrolling all the way to the bottom and then use page.content(). I have also tried scrolling back to the top of the page after I scrolled to the bottom and then use page.content(). Both ways grabs some rows of the table, but not all of them, which is my main goal. I believe that the web page uses lazy loading from react.js.

const puppeteer = require('puppeteer');
const url = 'https://www.torontopearson.com/en/departures';
const fs = require('fs');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    await page.goto(url);
    await page.waitFor(300);

    //scroll to bottom
    await autoScroll(page);
    await page.waitFor(2500);

    //scroll to top of page
    await page.evaluate(() => window.scrollTo(0, 50));

    let html = await page.content();

    await fs.writeFile('scrape.html', html, function(err){
        if (err) throw err;
        console.log("Successfully Written to File.");
    });
    await browser.close();
});

//method used to scroll to bottom, referenced from user visualxcode on https://github.com/GoogleChrome/puppeteer/issues/305
async function autoScroll(page){ 
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0;
            var distance = 300;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

Puppeteer:从使用延迟加载的页面中抓取整个 html [英] Puppeteer: Grabbing entire html from page that uses lazy load

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Puppeteer:从使用延迟加载的页面中抓取整个 html [英] Puppeteer: Grabbing entire html from page that uses lazy load

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭