获取文本后跟特定文本或获取所有文本(如果该文本丢失) [英] Get text followed by certain text or get all text if that text is missing
问题描述
我需要从 HTML 页面获取文本,但其中一些包含不必要的文本,这些文本位于页面中的某些文本之后('---------').例如.HTML 页面示例 1:
I need to get the texts from HTML pages but some of them contain unnecessary texts which go after certain text in page ('---------'). E.g. example of HTML page 1:
...
<p> This is correct text. Everything after it is wrong</p>
<p>---------</p>
<p><strong>This is wrong text</strong></p>
<p> This is wrong another text</p>
...
HTML 页面 2 示例:
Example of HTML page 2:
...
<p> This is correct text. Everything after it is wrong</p>
<p> This text is also valid </p>
<p> This is another correct text</p>
...
因此,如果页面包含 '-----------------',我只需要在它之前抓取文本 - 我需要抓取所有内容.如此处所述(获取文本后跟特定文本),我可以使用:
So if page contains '-----------------', I need to grab only texts before it otherways - I need to grab everything. As noted here (Get text followed by certain text) I can use:
//p[following-sibling::p[contains(.,'---------')]][1]/text()
对于第一个示例.但是有没有办法在两种情况下都使用一个 XPath?
For the 1st example. But is there a way to use one XPath for both cases?
推荐答案
//p[ not(contains(.,'---------'))
and not(preceding-sibling::p[contains(.,'---------')])]//text()
会回来
This is correct text. Everything after it is wrong
对于您的第一个案例和
This is correct text. Everything after it is wrong
This text is also valid
This is another correct text
对于您的第二种情况,根据要求.
for your second case, as requested.
这篇关于获取文本后跟特定文本或获取所有文本(如果该文本丢失)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!