刮<p><h2>之间的标签带有 Puppeteer 的标签 [英] Scrape <p> tags between <h2> tags with Puppeteer

查看：34 发布时间：2021/6/23 19:05:55 javascript node.js web-scraping puppeteer

本文介绍了刮<p><h2>之间的标签带有 Puppeteer 的标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 puppeteer 新手，正在学习抓取网页.网页的结构是这样的:

I am new to puppeteer and learning to scrape a web page. The web page is structured in this way:

我想要做的是抓取

`之间的所有`
`标签.状态 </h2>` 和 `<h2>Naam</h2>`.使用我当前的代码，我可以抓取此页面上的所有 `<p>` 标签.直到现在我才尝试在

`之后抓取所有标签.状态`

直到Naam.

What I'm trying to do is to scrape all <p> tags between the <h2> Status </h2> and the <h2>Naam</h2>. With my current code, I can scrape all <p> tags on this page. Only now I try to scrape all <p> tags after the <h2> Status </h2> up to the <h2>Naam</h2>.

我当前的代码:

const puppeteer = require('puppeteer');

const plaatsengids = async (place) => {
    //Creates a Headless Browser Instance in the Background
    const browser = await puppeteer.launch();

    //Creates a Page Instance, similar to creating a new Tab
    const page = await browser.newPage();

    //Navigate the page to url
    await page.goto('https://plaatsengids.nl/'+place);

  /*  page.waitForSelector('.title').then(async function(){
        const title = await page.$eval('.title', element => element.innerHTML);
    })*/

    //Finds the first element with the id 'hplogo' and returns the source attribute of that element
    const Title = await page.$eval('.title', element => element.innerHTML);
    const description = await page.$eval('.body p', element => element.innerHTML);

let content = await page.evaluate(() => {
    
    let divs = [...document.querySelectorAll('.body p')];
    return divs.map((div) => div.textContent.replace("- ",""));
  });



    //Closes the Browser Instance
    await browser.close();
    return content;
};




module.exports = plaatsengids;

相关网页是:https://www.plaatsengids.nl/Stein

推荐答案

您可以使用 Node.compareDocumentPosition():

You can use Node.compareDocumentPosition():

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://www.plaatsengids.nl/Stein');

    const paragraphs = await page.evaluate(() => {
      const status = document.querySelector('h2[name="status"]');
      const naam = document.querySelector('h2[name="naam"]');

      return [...document.querySelectorAll('p')]
        .filter(p => p.compareDocumentPosition(status) & Node.DOCUMENT_POSITION_PRECEDING &&
                     p.compareDocumentPosition(naam) & Node.DOCUMENT_POSITION_FOLLOWING)
        .map(p => p.innerText);
    });

    console.log(paragraphs);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

这篇关于刮<p><h2>之间的标签带有 Puppeteer 的标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

刮<p><h2>之间的标签带有 Puppeteer 的标签 [英] Scrape <p> tags between <h2> tags with Puppeteer

问题描述

`之间的所有`
`标签.状态 </h2>` 和 `<h2>Naam</h2>`.使用我当前的代码，我可以抓取此页面上的所有 `<p>` 标签.直到现在我才尝试在

`之后抓取所有标签.状态`

Naam

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

刮&lt;p&gt;&lt;h2&gt;之间的标签带有 Puppeteer 的标签 [英] Scrape &lt;p&gt; tags between &lt;h2&gt; tags with Puppeteer

问题描述

之间的所有 标签.状态 </h2> 和 <h2>Naam</h2>.使用我当前的代码，我可以抓取此页面上的所有 <p> 标签.直到现在我才尝试在

之后抓取所有 标签.状态

Naam

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

刮<p><h2>之间的标签带有 Puppeteer 的标签 [英] Scrape <p> tags between <h2> tags with Puppeteer

`之间的所有`
`标签.状态 </h2>` 和 `<h2>Naam</h2>`.使用我当前的代码，我可以抓取此页面上的所有 `<p>` 标签.直到现在我才尝试在

`之后抓取所有标签.状态`

登录关闭