Puppeteer 无法获取完整的源代码 [英] Puppeteer is unable to get the complete source code

查看:29
本文介绍了Puppeteer 无法获取完整的源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Node.js 和

解决方案

页面正在使用框架.您只能看到页面的主要内容(没有框架的内容).要获取框架的内容,您需要先找到框架(例如通过 page.$) 然后通过 elementHandle.contentFrame.然后你可以调用 frame.content() 获取框架的内容.

简单示例

const frameElementHandle = await page.$('#selector iframe');const frame = await frameElementHandle.contentFrame();const frameContent = 等待 frame.content();

根据页面的结构,您需要为多个框架执行此操作以获取所有内容,或者您​​甚至需要为框架内的一个框架执行此操作(对于给定页面似乎就是这种情况).

读取所有框架内容的示例

下面是一个递归读取页面上所有框架内容的例子.

const contents = [];异步函数 extractFrameContents(pageOrFrame) {const frame = await pageOrFrame.$$('iframe');for (let frameElement of frames) {const frame = await frameElement.contentFrame();const frameContent = 等待 frame.content();//对内容做一些事情,例如:内容.推(框架内容);//递归重复等待提取框架内容(框架);}}等待提取帧内容(页面);

I'm creating a simple scraping application with Node.js and Puppeteer. The page I'm trying to scrape is this. Below is the code I'm using right now.

const url = `https://www.betrebels.gr/el/sports?catids=122,40,87,28,45,2&champids=423,274616,1496978,1484069,1484383,465990,465991,91,71,287,488038,488076,488075,1483480,201,2,367,38,1481454,18,226,440,441,442,443,444,445,446,447,448,449,451,452,453,456,457,458,459,460,278261&datefilter=TodayTomorrow&page=prelive`
await page.goto(url, {waitUntil: 'networkidle2'});
let content: string = await page.content();
await page.screenshot({path: 'page.png',fullPage: true});
await fs.writeFile("temp.html", content);
//...Analyze the html and other stuff.

The screenshot I'm getting is this which is what I'm expecting.

On the other hand, the page content is minimal and doesn't represent the data on the image.

Am I doing something wrong? Am I not waiting properly for the Javascript to finish?

解决方案

The page is using frames. You are only seeing the main content of the page (without the content of the frames). To also get the content of the frame, you need to first find the frame (e.g. via page.$) and then get its frame handle via elementHandle.contentFrame. You can then call frame.content() to get the content of the frame.

Simple Example

const frameElementHandle = await page.$('#selector iframe');
const frame = await frameElementHandle.contentFrame();
const frameContent = await frame.content();

Depending on the structure of the page, you need to do this for multiple frames to get all contents or you even need to do it for a frame inside the frame (what seems to be the case for the given page).

Example to read all frame contents

Below is an example that recursively read the contents of all frames on the page.

const contents = [];
async function extractFrameContents(pageOrFrame) {
  const frames = await pageOrFrame.$$('iframe');
  for (let frameElement of frames) {
    const frame = await frameElement.contentFrame();
    const frameContent = await frame.content();

    // do something with the content, example:
    contents.push(frameContent);

    // recursively repeat
    await extractFrameContents(frame); 
  }
}
await extractFrameContents(page);

这篇关于Puppeteer 无法获取完整的源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆