获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素 [英] Get all visible plain text and find out which HTML tag or DOM element each piece of text belongs to

查看：45 发布时间：2021/6/23 19:03:18 puppeteer

本文介绍了获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道如何在页面上获取所有可见的纯文本:

const text = await page.$eval('*', el => el.innerText);

但我还需要知道每段文本属于页面的哪个元素，我找不到办法做到这一点.

解决方案

在客户端，您可以使用 TreeWalker.以下是来自 Web Scraper Testing Ground 的示例内容示例:

const IGNORE = ["style", "script"];const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);const 对 = [];让节点；while ((node = walker.nextNode()) !== null) {const parent = node.parentNode.tagName;如果(忽略.包括(父)){继续;}const 值 = node.nodeValue.trim();if (value.length === 0) {继续;}pair.push([parent.toLowerCase(), value]);}console.log(pairs);

<div id="topbar"></div><a href="/" style="text-decoration: none"><div id="title">WEB SCRAPER TESTING GROUND</div><div id="logo"></div></a><div id="内容"><h1>块:价目表</h1><div id="caseinfo">在这个测试中，网络爬虫需要爬取以块布局组织的价格表.具体来说，它必须:<ol><li>提取所有产品(名称、描述和价格)，同时跳过广告</li><li>仅抓取打折产品</li><li>仅以红色价格抓取产品</li></ol><p></p><p>有一个<b>ver</b>参数(从 1 到 5 不等)以显示不同的表格版本(具有不同的产品编号、最佳价格和广告位置).</p><p>还有两个表格:</p><ul><li><b>情况1</b>(简单的一种，产品和价格放在同一个区块中)</li><li><b>情况2</b>(复杂的一个，产品和价格放在单独的块中)<p></p><p>为了进行测试，您可以使用以下示例链接.刮板应该使用同一个项目从某个案例中充分刮取所有数据:</p><ul><li><a href="/blocks?ver=1">价目表1</a></li><li><a href="/blocks?ver=2">价目表2</a></li><li><a href="/blocks?ver=3">价目表3</a></li><li><a href="/blocks?ver=4">价目表4</a></li><li><a href="/blocks?ver=5">价目表 5</a></li><p></p>

<div id="case_blocks"><h2>情况1</h2><div id="case1"><div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 笔记本无线计算机</div>2 GHz Intel Pentium M，1 GB DDR2 SDRAM，40 GB，Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><;span style="float: left"><div class="name">Samsung Chromebook(Wi-Fi，11.6 英寸)</div>1.7 GHz，2 GB DDR3 SDRAM，16 GB，Chrome2.3 GHz Core i3-2350M，6 GB SDRAM，640 GB，Windows 7 Home Premium 64 位</div><div class="prod1"><div class="name">华硕A53Z-AS61 15.6 英寸笔记本电脑 (Mocha)</div>1.4 GHz A 系列四核 A6-3420M，4 GB DIMM，750 GB，Windows 7 Home Premium 64 位</div></div><div class="right"><div class="price2">$549.99<div class="disc">折扣 7%</div></div><div class="price1">$399.99</div></div></div>

<br><br><br>

根据 Grant Miller 的回答，使用 evaluate 在 Puppeteer 中调用它:

const 对 = await page.evaluate(() => {const IGNORE = ["style", "script"];const NONWHITESPACE_RE =/\S/;const 结果 = document.evaluate("///*[child::text()]",文档，空值，XPathResult.ORDERED_NODE_SNAPSHOT_TYPE，空值);const 对 = [];for (let i = 0, j = result.snapshotLength; i < j; i++) {const 元素 = result.snapshotItem(i);如果 (IGNORE.includes(element.tagName.toLowerCase())) {继续;}const 节点 = [...element.childNodes];for(节点的const节点){if (node.nodeType !== document.TEXT_NODE) {继续;}if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {继续;}对.推({标签: element.tagName.toLowerCase(),文本: node.nodeValue.trim()});}}返回对；});控制台日志(对)；

<小时>

这是客户端函数的原始版本，它使用 XPath 但始终将节点的直接子节点放在其间接子节点之前:

const IGNORE = ["style", "script"];const NONWHITESPACE_RE =/\S/;//获取文档中的所有文本节点const 结果 = document.evaluate(//匹配文档中至少有一个直接节点的任何节点//文本节点子节点，包括纯空白节点"///*[child::text()]",文档，空值，XPathResult.ORDERED_NODE_SNAPSHOT_TYPE，空值);//结果不使用 JavaScript 迭代器协议，所以我们有//手动迭代元素const 对 = [];for (let i = 0, j = result.snapshotLength; i < j; i++) {const element = result.snapshotItem(i);如果 (IGNORE.includes(element.tagName.toLowerCase())) {继续;}const 节点 = [...element.childNodes];for(节点的const节点){if (node.nodeType !== document.TEXT_NODE) {继续;}//过滤掉只有空白的节点if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {继续;}对.推({标签: element.tagName.toLowerCase(),//删除 `.trim()` 以保留前导 &尾随空格文本: node.nodeValue.trim()});}}console.log(pairs);

 <div id="topbar"></div><a href="/" style="text-decoration: none"><div id="title">WEB SCRAPER TESTING GROUND</div><div id="logo"></div></a><div id="内容"><h1>块:价目表</h1><div id="caseinfo">在这个测试中，网络爬虫需要爬取以块布局组织的价格表.具体来说，它必须:<ol><li>提取所有产品(名称、描述和价格)，同时跳过广告</li><li>仅抓取打折产品</li><li>仅以红色价格抓取产品</li></ol><p></p><p>有一个<b>ver</b>参数(从 1 到 5 不等)以显示不同的表格版本(具有不同的产品编号、最佳价格和广告位置).</p><p>还有两个表格:</p><ul><li><b>情况1</b>(简单的一种，产品和价格放在同一个区块中)</li><li><b>情况2</b>(复杂的一个，产品和价格放在单独的块中)<p></p><p>为了进行测试，您可以使用以下示例链接.刮板应该使用同一个项目从某个案例中充分刮取所有数据:</p><ul><li><a href="/blocks?ver=1">价目表1</a></li><li><a href="/blocks?ver=2">价目表2</a></li><li><a href="/blocks?ver=3">价目表3</a></li><li><a href="/blocks?ver=4">价目表4</a></li><li><a href="/blocks?ver=5">价目表 5</a></li><p></p>

<br><br><br>

const IGNORE = ["style", "script"]; const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT); const pairs = []; let node; while ((node = walker.nextNode()) !== null) { const parent = node.parentNode.tagName; if (IGNORE.includes(parent)) { continue; } const value = node.nodeValue.trim(); if (value.length === 0) { continue; } pairs.push([parent.toLowerCase(), value]); } console.log(pairs);

<div id="topbar"></div> <a href="/" style="text-decoration: none"> <div id="title">WEB SCRAPER TESTING GROUND</div> <div id="logo"></div> </a> <div id="content"> <h1>BLOCKS: Price List </h1> <div id="caseinfo">In this test, the web scraper needs to scrape a price list organized in a block layout. Specifically, it has to: <ol> <li>Extract all the products (their names, descriptions and prices), while skipping advertisements</li> <li>Scrape discounted products only</li> <li>Scrape products with red prices only</li> </ol> <p> </p><p>There is a <b>ver</b> parameter (which varies from 1 to 5) to show different table versions (with different product numbers, best price and advertisement positions).</p> <p>Also there are two tables presented: </p><ul> <li><b>Case 1</b> (simple one, with products and prices placed into the same block) </li><li><b>Case 2</b> (complicated one, with products and prices placed into separate blocks)</li> </ul> <p></p> <p>For testing, you may use the following sample links. The scraper should sufficiently scrape all data from a certain case using the same project: </p><ul> <li><a href="/blocks?ver=1">Price list 1</a></li> <li><a href="/blocks?ver=2">Price list 2</a></li> <li><a href="/blocks?ver=3">Price list 3</a></li> <li><a href="/blocks?ver=4">Price list 4</a></li> <li><a href="/blocks?ver=5">Price list 5</a></li> </ul> <p></p> </div> <div id="case_blocks"> <h2>Case 1</h2> <div id="case1"> <div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><span style="float: left"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</span><span style="float: right" class="best">$249.00</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</span><span style="float: right">$1,099.99</span></div><div class="prod1"><span style="float: left"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</span><span style="float: right" class="best">$385.72</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$549.99<div class="disc">discount 7%</div></span></div><div class="prod1"><span style="float: left"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$399.99</span></div></div> <h2 style="margin-top: 50px">Case 2</h2> <div id="case2"> <div class="left"><div class="prod2"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</div><div class="prod1"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</div><div class="ads">ADVERTISEMENT</div><div class="prod2"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</div><div class="prod1"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$239.95</div><div class="price1 best">$249.00</div><div class="ads"></div><div class="price2">$1,099.99</div><div class="price1 best">$385.72</div></div><div class="ads" style="clear: both">ADVERTISEMENT</div><div class="left"><div class="prod2"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</div><div class="prod1"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$549.99<div class="disc">discount 7%</div></div><div class="price1">$399.99</div></div></div> </div> <br><br><br> </div>

获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素 [英] Get all visible plain text and find out which HTML tag or DOM element each piece of text belongs to

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素 [英] Get all visible plain text and find out which HTML tag or DOM element each piece of text belongs to

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭