通过 XPath 提取节点之间的文本 [英] Extracting text in between nodes through XPath

查看:69
本文介绍了通过 XPath 提取节点之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 XPath 读取网页的特定部分.该页面的格式不是很好,但我无法更改...

<div class="textfield"><div class="header">第一项</div>这是<strong>第一个</strong>的正文.物品.<div class="header">第二项</div><span>这是第二个项目的文本.</span><div class="header">第三项</div>这是第三项的正文.

<div class="textfield">页脚文本

</root>

我想提取各种项目的文本,即标题 div 之间的文本(例如这是第一项的文本.").到目前为止,我已经使用了这个 XPath 表达式:

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'第二项')]]

但是,我无法对结束项名称进行硬编码,因为在我想要抓取的页面中,项的顺序不同(例如,第一项"可能后跟第三项").

任何有关如何调整我的 XPath 查询的帮助将不胜感激.

解决方案

为了完整起见,最终查询由贯穿整个线程的各种建议组成:

///*[@class='textfield' 和 position() = 1]//文本() [前::*[@class='header' 和 contains(text(),'First item')]][下列的::*[前::*[@class='标题'][1][包含(文本(),'第一项')]]]

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').

Any help on how to adapt my XPath query would be greatly appreciated.

解决方案

For the sake of completeness, the final query, composed of various suggestions throughout the thread:

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]

这篇关于通过 XPath 提取节点之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆