根据上一个和下一个同级提取文本 [英] Extract text based on previous and next sibling
问题描述
我正在尝试从以下结构中提取数据:
I'm trying to extract data from the following structure:
<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br />
data#1
<br />
<br />
<span>Heading4</span><br />
• data#4.1
<br />
• data#4.2
<br />
• data#4.3
<br />
• data#4.4
<br />
<br />
<span>Heading5</span>
<br />
• data#5.1
<br />
• data#5.2
<br />
• data#5.3
<br />
<br />
我可以使用这样的方法提取数据#1:
I can extract data#1 using something like this:
span[text()='Heading1']/following-sibling::br[1]/following::text()[1]
但是我不知道如何提取 Heading4 下的数据.我需要提取 data#4.1
, data#4.2
, data#4.3
&data#4.4
.点数不是固定的,可以变化.
But I cant figure out how to extract the data under Heading4. I need to extract data#4.1
, data#4.2
, data#4.3
& data#4.4
.
The number of points is not fixed and can vary.
推荐答案
这个 XPath 1.0 表达式精确地选择了想要的节点:
/*/span[.='Heading4']
/following-sibling::text()
[count(.|/*/span[.='Heading5']/preceding-sibling::text())
=
count(/*/span[.='Heading5']/preceding-sibling::text())
]
[normalize-space()]
它是由著名的 Kayessian 方法产生的,用于两个节点集 $ns1
和 $ns2
的交集:
It is produced from the well-known Kayessian method for intersection of two nodesets $ns1
and $ns2
:
$ns1[count(.|$ns2) = count($ns2)]
如果在 Kayessian 公式中我们将 $ns1
替换为:
We obtain the first expression above if in the Kayessian formula we substitute $ns1
with:
/*/span[.='Heading4']/following-sibling::text()
和 $ns2
与:
/*/span[.='Heading5']/preceding-sibling::text()
最后的谓词 [normalize-space()]
从这个交集过滤掉只有空白的文本节点.
The final predicate [normalize-space()]
filters out the whitespace-only text nodes from this intersection.
基于 XSLT 的验证:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/span[.='Heading4']
/following-sibling::text()
[count(.|/*/span[.='Heading5']/preceding-sibling::text())
=
count(/*/span[.='Heading5']/preceding-sibling::text())
]
[normalize-space()]
"/>
</xsl:template>
</xsl:stylesheet>
当此转换应用于提供的 XML 文档时(替换实体 - 因为我们没有定义它们可用的 DTD,这在这里不是必需的):
When this transformation is applied on the provided XML document (with the entities replaced -- as we don't have a DTD defining them available and this isn't essential here):
<html>
<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br /> data#1
<br />
<br />
<span>Heading4</span>
<br /> #acirc;#euro;#cent; data#4.1
<br /> #acirc;#euro;#cent; data#4.2
<br /> #acirc;#euro;#cent; data#4.3
<br /> #acirc;#euro;#cent; data#4.4
<br />
<br />
<span>Heading5</span>
<br /> #acirc;#euro;#cent; data#5.1
<br /> #acirc;#euro;#cent; data#5.2
<br /> #acirc;#euro;#cent; data#5.3
<br />
<br />
</html>
计算 Xpath 表达式并将计算结果复制到输出:
#acirc;#euro;#cent; data#4.1
#acirc;#euro;#cent; data#4.2
#acirc;#euro;#cent; data#4.3
#acirc;#euro;#cent; data#4.4
这篇关于根据上一个和下一个同级提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!