获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素 [英] Get all visible plain text and find out which HTML tag or DOM element each piece of text belongs to

查看:45
本文介绍了获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道如何在页面上获取所有可见的纯文本:

const text = await page.$eval('*', el => el.innerText);

但我还需要知道每段文本属于页面的哪个元素,我找不到办法做到这一点.

解决方案

在客户端,您可以使用 TreeWalker.以下是来自 Web Scraper Testing Ground 的示例内容示例:

const IGNORE = ["style", "script"];const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);const 对 = [];让节点;while ((node = walker.nextNode()) !== null) {const parent = node.parentNode.tagName;如果(忽略.包括(父)){继续;}const 值 = node.nodeValue.trim();if (value.length === 0) {继续;}pair.push([parent.toLowerCase(), value]);}console.log(pairs);

<div id="topbar"></div><a href="/" style="text-decoration: none"><div id="title">WEB SCRAPER TESTING GROUND</div><div id="logo"></div></a><div id="内容"><h1>块:价目表</h1><div id="caseinfo">在这个测试中,网络爬虫需要爬取以块布局组织的价格表.具体来说,它必须:<ol><li>提取所有产品(名称、描述和价格),同时跳过广告</li><li>仅抓取打折产品</li><li>仅以红色价格抓取产品</li></ol><p></p><p>有一个<b>ver</b>参数(从 1 到 5 不等)以显示不同的表格版本(具有不同的产品编号、最佳价格和广告位置).</p><p>还有两个表格:</p><ul><li><b>情况1</b>(简单的一种,产品和价格放在同一个区块中)</li><li><b>情况2</b>(复杂的一个,产品和价格放在单独的块中)<p></p><p>为了进行测试,您可以使用以下示例链接.刮板应该使用同一个项目从某个案例中充分刮取所有数据:</p><ul><li><a href="/blocks?ver=1">价目表1</a></li><li><a href="/blocks?ver=2">价目表2</a></li><li><a href="/blocks?ver=3">价目表3</a></li><li><a href="/blocks?ver=4">价目表4</a></li><li><a href="/blocks?ver=5">价目表 5</a></li><p></p>

<div id="case_blocks"><h2>情况1</h2><div id="case1"><div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 笔记本无线计算机</div>2 GHz Intel Pentium M,1 GB DDR2 SDRAM,40 GB,Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><;span style="float: left"><div class="name">Samsung Chromebook(Wi-Fi,11.6 英寸)</div>1.7 GHz,2 GB DDR3 SDRAM,16 GB,Chrome2.3 GHz Core i3-2350M,6 GB SDRAM,640 GB,Windows 7 Home Premium 64 位</div><div class="prod1"><div class="name">华硕A53Z-AS61 15.6 英寸笔记本电脑 (Mocha)</div>1.4 GHz A 系列四核 A6-3420M,4 GB DIMM,750 GB,Windows 7 Home Premium 64 位</div></div><div class="right"><div class="price2">$549.99<div class="disc">折扣 7%</div></div><div class="price1">$399.99</div></div></div>

<br><br><br>

根据 Grant Miller 的回答,使用 evaluate 在 Puppeteer 中调用它:

const 对 = await page.evaluate(() => {const IGNORE = ["style", "script"];const NONWHITESPACE_RE =/\S/;const 结果 = document.evaluate("///*[child::text()]",文档,空值,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,空值);const 对 = [];for (let i = 0, j = result.snapshotLength; i < j; i++) {const 元素 = result.snapshotItem(i);如果 (IGNORE.includes(element.tagName.toLowerCase())) {继续;}const 节点 = [...element.childNodes];for(节点的const节点){if (node.nodeType !== document.TEXT_NODE) {继续;}if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {继续;}对.推({标签: element.tagName.toLowerCase(),文本: node.nodeValue.trim()});}}返回对;});控制台日志(对);

<小时>

这是客户端函数的原始版本,它使用 XPath 但始终将节点的直接子节点放在其间接子节点之前:

const IGNORE = ["style", "script"];const NONWHITESPACE_RE =/\S/;//获取文档中的所有文本节点const 结果 = document.evaluate(//匹配文档中至少有一个直接节点的任何节点//文本节点子节点,包括纯空白节点"///*[child::text()]",文档,空值,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,空值);//结果不使用 JavaScript 迭代器协议,所以我们有//手动迭代元素const 对 = [];for (let i = 0, j = result.snapshotLength; i < j; i++) {const element = result.snapshotItem(i);如果 (IGNORE.includes(element.tagName.toLowerCase())) {继续;}const 节点 = [...element.childNodes];for(节点的const节点){if (node.nodeType !== document.TEXT_NODE) {继续;}//过滤掉只有空白的节点if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {继续;}对.推({标签: element.tagName.toLowerCase(),//删除 `.trim()` 以保留前导 &尾随空格文本: node.nodeValue.trim()});}}console.log(pairs);

 <div id="topbar"></div><a href="/" style="text-decoration: none"><div id="title">WEB SCRAPER TESTING GROUND</div><div id="logo"></div></a><div id="内容"><h1>块:价目表</h1><div id="caseinfo">在这个测试中,网络爬虫需要爬取以块布局组织的价格表.具体来说,它必须:<ol><li>提取所有产品(名称、描述和价格),同时跳过广告</li><li>仅抓取打折产品</li><li>仅以红色价格抓取产品</li></ol><p></p><p>有一个<b>ver</b>参数(从 1 到 5 不等)以显示不同的表格版本(具有不同的产品编号、最佳价格和广告位置).</p><p>还有两个表格:</p><ul><li><b>情况1</b>(简单的一种,产品和价格放在同一个区块中)</li><li><b>情况2</b>(复杂的一个,产品和价格放在单独的块中)<p></p><p>为了进行测试,您可以使用以下示例链接.刮板应该使用同一个项目从某个案例中充分刮取所有数据:</p><ul><li><a href="/blocks?ver=1">价目表1</a></li><li><a href="/blocks?ver=2">价目表2</a></li><li><a href="/blocks?ver=3">价目表3</a></li><li><a href="/blocks?ver=4">价目表4</a></li><li><a href="/blocks?ver=5">价目表 5</a></li><p></p>

<div id="case_blocks"><h2>情况1</h2><div id="case1"><div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 笔记本无线计算机</div>2 GHz Intel Pentium M,1 GB DDR2 SDRAM,40 GB,Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><;span style="float: left"><div class="name">Samsung Chromebook(Wi-Fi,11.6 英寸)</div>1.7 GHz,2 GB DDR3 SDRAM,16 GB,Chrome2.3 GHz Core i3-2350M,6 GB SDRAM,640 GB,Windows 7 Home Premium 64 位</div><div class="prod1"><div class="name">华硕A53Z-AS61 15.6 英寸笔记本电脑 (Mocha)</div>1.4 GHz A 系列四核 A6-3420M,4 GB DIMM,750 GB,Windows 7 Home Premium 64 位</div></div><div class="right"><div class="price2">$549.99<div class="disc">折扣 7%</div></div><div class="price1">$399.99</div></div></div>

<br><br><br>

I know how to get all visible plain text on a page:

const text = await page.$eval('*', el => el.innerText);

But I also need to know which element of the page each piece of text belongs to, and I can't find a way to do that.

解决方案

On the client side, you can do this in a way that preserves order using TreeWalker. Here’s an example with sample content from Web Scraper Testing Ground:

const IGNORE = ["style", "script"];

const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);

const pairs = [];

let node;

while ((node = walker.nextNode()) !== null) {
  const parent = node.parentNode.tagName;

  if (IGNORE.includes(parent)) {
    continue;
  }

  const value = node.nodeValue.trim();

  if (value.length === 0) {
    continue;
  }

  pairs.push([parent.toLowerCase(), value]);
}

console.log(pairs);

<div id="topbar"></div>
		<a href="/" style="text-decoration: none">
		    <div id="title">WEB SCRAPER TESTING GROUND</div>
		    <div id="logo"></div>
		</a>
		<div id="content">
<h1>BLOCKS: Price List </h1>
<div id="caseinfo">In this test, the web scraper needs to scrape a price list organized in a block layout. Specifically, it has to:
	<ol>
		<li>Extract all the products (their names, descriptions and prices), while skipping advertisements</li>
		<li>Scrape discounted products only</li>
		<li>Scrape products with red prices only</li>
	</ol>
<p>
</p><p>There is a <b>ver</b> parameter (which varies from 1 to 5) to show different table versions (with different product numbers, best price and advertisement positions).</p>
<p>Also there are two tables presented:
	</p><ul>
		<li><b>Case 1</b> (simple one, with products and prices placed into the same block)
		</li><li><b>Case 2</b> (complicated one, with products and prices placed into separate blocks)</li>
	</ul>
<p></p>
<p>For testing, you may use the following sample links. The scraper should sufficiently scrape all data from a certain case using the same project:
</p><ul>
	<li><a href="/blocks?ver=1">Price list 1</a></li>
	<li><a href="/blocks?ver=2">Price list 2</a></li>
	<li><a href="/blocks?ver=3">Price list 3</a></li>
	<li><a href="/blocks?ver=4">Price list 4</a></li>
	<li><a href="/blocks?ver=5">Price list 5</a></li>
</ul>
<p></p>
</div>

<div id="case_blocks">

<h2>Case 1</h2>
<div id="case1">
<div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><span style="float: left"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</span><span style="float: right" class="best">$249.00</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</span><span style="float: right">$1,099.99</span></div><div class="prod1"><span style="float: left"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</span><span style="float: right" class="best">$385.72</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$549.99<div class="disc">discount 7%</div></span></div><div class="prod1"><span style="float: left"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$399.99</span></div></div>

<h2 style="margin-top: 50px">Case 2</h2>
<div id="case2">
<div class="left"><div class="prod2"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</div><div class="prod1"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</div><div class="ads">ADVERTISEMENT</div><div class="prod2"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</div><div class="prod1"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$239.95</div><div class="price1 best">$249.00</div><div class="ads"></div><div class="price2">$1,099.99</div><div class="price1 best">$385.72</div></div><div class="ads" style="clear: both">ADVERTISEMENT</div><div class="left"><div class="prod2"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</div><div class="prod1"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$549.99<div class="disc">discount 7%</div></div><div class="price1">$399.99</div></div></div>

</div>
<br><br><br>
		</div>

Use evaluate to call this in Puppeteer, per Grant Miller’s answer:

const pairs = await page.evaluate(() => {
  const IGNORE = ["style", "script"];
  const NONWHITESPACE_RE = /\S/;

  const result = document.evaluate(
    "//*[child::text()]",
    document,
    null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
    null
  );

  const pairs = [];

  for (let i = 0, j = result.snapshotLength; i < j; i++) {
    const element = result.snapshotItem(i);

    if (IGNORE.includes(element.tagName.toLowerCase())) {
      continue;
    }

    const nodes = [...element.childNodes];

    for (const node of nodes) {
      if (node.nodeType !== document.TEXT_NODE) {
        continue;
      }

      if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {
        continue;
      }

      pairs.push({
        tag: element.tagName.toLowerCase(),
        text: node.nodeValue.trim()
      });
    }
  }

  return pairs;
});

console.log(pairs);


Here is the original version of the client-side function, which uses XPath but always puts the direct children of a node before its indirect children:

const IGNORE = ["style", "script"];
const NONWHITESPACE_RE = /\S/;

// get all text nodes in the document
const result = document.evaluate(
  // matches any node in the document that has at least one direct
  // text node child, including whitespace-only nodes
  "//*[child::text()]",
  document,
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null
);

// the result doesn't use the JavaScript iterator protocol, so we have
// to manually iterate over the elements
const pairs = [];

for (let i = 0, j = result.snapshotLength; i < j; i++) {
  const element = result.snapshotItem(i);

  if (IGNORE.includes(element.tagName.toLowerCase())) {
    continue;
  }

  const nodes = [...element.childNodes];

  for (const node of nodes) {
    if (node.nodeType !== document.TEXT_NODE) {
      continue;
    }

    // filter out whitespace-only nodes
    if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {
      continue;
    }

    pairs.push({
      tag: element.tagName.toLowerCase(),
      // remove the `.trim()` to preserve leading & trailing whitespace
      text: node.nodeValue.trim()
    });
  }
}

console.log(pairs);

        <div id="topbar"></div>
		<a href="/" style="text-decoration: none">
		    <div id="title">WEB SCRAPER TESTING GROUND</div>
		    <div id="logo"></div>
		</a>
		<div id="content">
<h1>BLOCKS: Price List </h1>
<div id="caseinfo">In this test, the web scraper needs to scrape a price list organized in a block layout. Specifically, it has to:
	<ol>
		<li>Extract all the products (their names, descriptions and prices), while skipping advertisements</li>
		<li>Scrape discounted products only</li>
		<li>Scrape products with red prices only</li>
	</ol>
<p>
</p><p>There is a <b>ver</b> parameter (which varies from 1 to 5) to show different table versions (with different product numbers, best price and advertisement positions).</p>
<p>Also there are two tables presented:
	</p><ul>
		<li><b>Case 1</b> (simple one, with products and prices placed into the same block)
		</li><li><b>Case 2</b> (complicated one, with products and prices placed into separate blocks)</li>
	</ul>
<p></p>
<p>For testing, you may use the following sample links. The scraper should sufficiently scrape all data from a certain case using the same project:
</p><ul>
	<li><a href="/blocks?ver=1">Price list 1</a></li>
	<li><a href="/blocks?ver=2">Price list 2</a></li>
	<li><a href="/blocks?ver=3">Price list 3</a></li>
	<li><a href="/blocks?ver=4">Price list 4</a></li>
	<li><a href="/blocks?ver=5">Price list 5</a></li>
</ul>
<p></p>
</div>

<div id="case_blocks">

<h2>Case 1</h2>
<div id="case1">
<div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><span style="float: left"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</span><span style="float: right" class="best">$249.00</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</span><span style="float: right">$1,099.99</span></div><div class="prod1"><span style="float: left"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</span><span style="float: right" class="best">$385.72</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$549.99<div class="disc">discount 7%</div></span></div><div class="prod1"><span style="float: left"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$399.99</span></div></div>

<h2 style="margin-top: 50px">Case 2</h2>
<div id="case2">
<div class="left"><div class="prod2"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</div><div class="prod1"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</div><div class="ads">ADVERTISEMENT</div><div class="prod2"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</div><div class="prod1"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$239.95</div><div class="price1 best">$249.00</div><div class="ads"></div><div class="price2">$1,099.99</div><div class="price1 best">$385.72</div></div><div class="ads" style="clear: both">ADVERTISEMENT</div><div class="left"><div class="prod2"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</div><div class="prod1"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$549.99<div class="disc">discount 7%</div></div><div class="price1">$399.99</div></div></div>

</div>
<br><br><br>
		</div>

这篇关于获取所有可见的纯文本并找出每段文本属于哪个 HTML 标签或 DOM 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆