使用 puppeteer 获取完整的网页源 html - 但总是缺少某些部分 [英] Get complete web page source html with puppeteer - but some part always missing

查看:520
本文介绍了使用 puppeteer 获取完整的网页源 html - 但总是缺少某些部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取下面网页上的特定字符串:

I am trying to scrape specific string on webpage below :

https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;

我想从这个网页源获得的信息是下面字符串中的数字序列(这是我可以在鼠标右键单击时搜索的内容 ->

The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0" 

我正在使用puppeteer"下面是我的代码:

I am using "puppeteer" and below is my code :

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

但是我在 response.text()page.content() 中找不到我要查找的字符串.

But I cannot find the strings I am looking for in response.text() or page.content().

我在页面中使用了错误的方法吗?

Am I using the wrong methods in page ?

如何转储网页上的实际页面源代码,与我右键单击鼠标完全相同?

How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?

推荐答案

如果您调查这些字符串出现的位置,那么您可以在具有特定类(<select>>.hprt-nos-select):

If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):

<select
  class="hprt-nos-select"
  name="nr_rooms_4377601_232287150_0_1_0"
  data-component="hotel/new-rooms-table/select-rooms"
  data-room-id="4377601"
  data-block-id="4377601_232287150_0_1_0"
  data-is-fflex-selected="0"
  id="hprt_nos_select_4377601_232287150_0_1_0"
  aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>

您将等到此元素加载到 DOM 中,然后它也会在页面源中可见:

You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:

await page.waitForSelector('.hprt-nos-select', { timeout: 0 });

但您的问题实际上在于这样一个事实,您正在访问的网址有一些额外的网址参数: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; puppeteer 没有考虑到这些(你可以截个整页截图,你会看到它仍然有默认的酒店搜索表单,没有具体的酒店优惠,而不是您期望的那些).

BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).

您应该使用 puppeteer(page.click() 等)与搜索表单进行交互,以自己设置日期和来源国家/地区,以实现预期的页面内容.

You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.

这篇关于使用 puppeteer 获取完整的网页源 html - 但总是缺少某些部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆