根据上一个和下一个同级提取文本 [英] Extract text based on previous and next sibling

查看：28 发布时间：2021/9/24 18:50:44 xpath web-scraping

本文介绍了根据上一个和下一个同级提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从以下结构中提取数据:

I'm trying to extract data from the following structure:

<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br />
data#1
<br />
<br />
<span>Heading4</span><br />
&acirc;&euro;&cent; data#4.1
<br />
&acirc;&euro;&cent; data#4.2
<br />
&acirc;&euro;&cent; data#4.3
<br />
&acirc;&euro;&cent; data#4.4
<br />
<br />
<span>Heading5</span>
<br />
&acirc;&euro;&cent; data#5.1
<br />
&acirc;&euro;&cent; data#5.2
<br />
&acirc;&euro;&cent; data#5.3
<br />
<br />

我可以使用这样的方法提取数据#1:

I can extract data#1 using something like this:

span[text()='Heading1']/following-sibling::br[1]/following::text()[1]

但是我不知道如何提取 Heading4 下的数据.我需要提取 data#4.1, data#4.2, data#4.3 &data#4.4.点数不是固定的，可以变化.

But I cant figure out how to extract the data under Heading4. I need to extract data#4.1, data#4.2, data#4.3 & data#4.4. The number of points is not fixed and can vary.

推荐答案

这个 XPath 1.0 表达式精确地选择了想要的节点:

  /*/span[.='Heading4']
        /following-sibling::text()
           [count(.|/*/span[.='Heading5']/preceding-sibling::text())
           =
            count(/*/span[.='Heading5']/preceding-sibling::text())
            ]
                  [normalize-space()]

它是由著名的 Kayessian 方法产生的，用于两个节点集 $ns1 和 $ns2 的交集:

It is produced from the well-known Kayessian method for intersection of two nodesets $ns1 and $ns2:

$ns1[count(.|$ns2) = count($ns2)]

如果在 Kayessian 公式中我们将 $ns1 替换为:

We obtain the first expression above if in the Kayessian formula we substitute $ns1 with:

  /*/span[.='Heading4']/following-sibling::text()

和 $ns2 与:

  /*/span[.='Heading5']/preceding-sibling::text()

最后的谓词 [normalize-space()] 从这个交集过滤掉只有空白的文本节点.

The final predicate [normalize-space()] filters out the whitespace-only text nodes from this intersection.

基于 XSLT 的验证:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:template match="/">
     <xsl:copy-of select=
      "/*/span[.='Heading4']
            /following-sibling::text()
               [count(.|/*/span[.='Heading5']/preceding-sibling::text())
               =
                count(/*/span[.='Heading5']/preceding-sibling::text())
                ]
                [normalize-space()]
      "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于提供的 XML 文档时(替换实体 - 因为我们没有定义它们可用的 DTD，这在这里不是必需的):

When this transformation is applied on the provided XML document (with the entities replaced -- as we don't have a DTD defining them available and this isn't essential here):

<html>
    <span>Heading</span>
    <br />
    <br />
    <span>Heading1</span>
    <br /> data#1 
    <br />
    <br />
    <span>Heading4</span>
    <br /> #acirc;#euro;#cent; data#4.1 
    <br /> #acirc;#euro;#cent; data#4.2 
    <br /> #acirc;#euro;#cent; data#4.3 
    <br /> #acirc;#euro;#cent; data#4.4 
    <br />
    <br />
    <span>Heading5</span>
    <br /> #acirc;#euro;#cent; data#5.1 
    <br /> #acirc;#euro;#cent; data#5.2 
    <br /> #acirc;#euro;#cent; data#5.3 
    <br />
    <br />
</html>

计算 Xpath 表达式并将计算结果复制到输出:

 #acirc;#euro;#cent; data#4.1 
     #acirc;#euro;#cent; data#4.2 
     #acirc;#euro;#cent; data#4.3 
     #acirc;#euro;#cent; data#4.4

这篇关于根据上一个和下一个同级提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据上一个和下一个同级提取文本 [英] Extract text based on previous and next sibling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据上一个和下一个同级提取文本 [英] Extract text based on previous and next sibling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭