刮<p><h2>之间的标签带有 Puppeteer 的标签 [英] Scrape <p> tags between <h2> tags with Puppeteer
本文介绍了刮<p><h2>之间的标签带有 Puppeteer 的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是 puppeteer 新手,正在学习抓取网页.网页的结构是这样的:
I am new to puppeteer and learning to scrape a web page. The web page is structured in this way:
我想要做的是抓取
和 之间的所有
标签.状态 </h2>
<h2>Naam</h2>
.使用我当前的代码,我可以抓取此页面上的所有 <p>
标签.直到现在我才尝试在
直到 之后抓取所有
标签.状态
.Naam
What I'm trying to do is to scrape all <p>
tags between the <h2> Status </h2>
and the <h2>Naam</h2>
. With my current code, I can scrape all <p>
tags on this page. Only now I try to scrape all <p>
tags after the <h2> Status </h2>
up to the <h2>Naam</h2>
.
我当前的代码:
const puppeteer = require('puppeteer');
const plaatsengids = async (place) => {
//Creates a Headless Browser Instance in the Background
const browser = await puppeteer.launch();
//Creates a Page Instance, similar to creating a new Tab
const page = await browser.newPage();
//Navigate the page to url
await page.goto('https://plaatsengids.nl/'+place);
/* page.waitForSelector('.title').then(async function(){
const title = await page.$eval('.title', element => element.innerHTML);
})*/
//Finds the first element with the id 'hplogo' and returns the source attribute of that element
const Title = await page.$eval('.title', element => element.innerHTML);
const description = await page.$eval('.body p', element => element.innerHTML);
let content = await page.evaluate(() => {
let divs = [...document.querySelectorAll('.body p')];
return divs.map((div) => div.textContent.replace("- ",""));
});
//Closes the Browser Instance
await browser.close();
return content;
};
module.exports = plaatsengids;
相关网页是:https://www.plaatsengids.nl/Stein
推荐答案
您可以使用 Node.compareDocumentPosition()
:
You can use Node.compareDocumentPosition()
:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://www.plaatsengids.nl/Stein');
const paragraphs = await page.evaluate(() => {
const status = document.querySelector('h2[name="status"]');
const naam = document.querySelector('h2[name="naam"]');
return [...document.querySelectorAll('p')]
.filter(p => p.compareDocumentPosition(status) & Node.DOCUMENT_POSITION_PRECEDING &&
p.compareDocumentPosition(naam) & Node.DOCUMENT_POSITION_FOLLOWING)
.map(p => p.innerText);
});
console.log(paragraphs);
await browser.close();
} catch (err) {
console.error(err);
}
})();
这篇关于刮<p><h2>之间的标签带有 Puppeteer 的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文