XPath - 在两个节点之间提取文本 [英] XPath - extracting text between two nodes

查看:54
本文介绍了XPath - 在两个节点之间提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 XPath 查询遇到问题.我必须解析一个 div,它被划分为未知数量的部分".每一个都用 h5 和一个节名分开.可能的部分标题列表是已知的,并且每个部分只能出现一次.此外,每个部分都可以包含一些 br 标签.所以,假设我想提取SecondHeader"下的文本.

HTML

<h5>FirstHeader</h5>文本1<h5>SecondHeader</h5>text2a
文本2b<h5>ThirdHeader</h5>text3a
text3b
text3c
<h5>FourthHeader</h5>文本 4

预期结果(对于 SecondSection)

['text2a', 'text2b']

查询 #1

//text()[following-sibling::h5/text()='ThirdHeader']

结果 #1

['text1', 'text2a', 'text2b']

这显然有点太多了,所以我决定将结果限制在所选标题和之前标题之间的内容.

查询#2

//text()[following-sibling::h5/text()='ThirdHeader' 和previous-sibling::h5/text()='SecondHeader']

结果#2

['text2a', 'text2b']

产生的结果符合预期.但是,这不能使用 - 我不知道 SecondHeader/ThirdHeader 是否会存在于解析页面中.查询中只需要使用一个部分标题.

查询 #3

//text()[following-sibling::h5/text()='ThirdHeader' 而不是[preceding-sibling::h5/text()='ThirdHeader']]

结果 #3

<代码>[]

你能告诉我我做错了什么吗?我已经在 Google Chrome 中对其进行了测试.

解决方案

你应该能够只测试前面的第一个兄弟 h5...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".

HTML

<div class="some-class">
 <h5>FirstHeader</h5>
  text1
 <h5>SecondHeader</h5>
  text2a<br>
  text2b
 <h5>ThirdHeader</h5>
  text3a<br>
  text3b<br>
  text3c<br>
 <h5>FourthHeader</h5>
  text4
</div>

Expected result (for SecondSection)

['text2a', 'text2b']

Query #1

//text()[following-sibling::h5/text()='ThirdHeader']

Result #1

['text1', 'text2a', 'text2b']

It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.

Query #2

//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']

Result #2

['text2a', 'text2b']

Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.

Query #3

//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]

Result #3

[]

Could you please tell me what am I doing wrong? I've tested it in Google Chrome.

解决方案

You should be able to just test the first preceding sibling h5...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

这篇关于XPath - 在两个节点之间提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆