XPath获取两个标题之间的标记 [英] XPath to get markup between two headings

查看:206
本文介绍了XPath获取两个标题之间的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个小应用程序以从Wikipedia页面提取内容.当我最初考虑是否可以使用XPath定位包含内容的div时,但是在研究了Wikipedia如何构建其文章之后,我很快发现这并不是一件容易的事.进入页面时,分隔内容的最佳方法是选择两组h2标签之间的内容.

I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2 tags.

示例: <h2>Title</h2> <div>Some Content</div> <h2>Title</h2>

在这里,我要获取标题集之间的div.我尝试使用XPath进行此操作,但是一点都不运气.我将对XPath进行更多的研究,因为我认为这是实现我想要的目标所需要的,但是在我对XPath进行过多研究之前,我想听听你们对它的看法. XPath是正确的方法还是我还有其他更简单的选择?如果有任何区别,我将用C#编写应用程序.

Here I would want to get the div between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think that's what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.

推荐答案

是的,您使用XPath的方向正确-非常适合选择XML文档的一部分.

例如,对于此XML,

<r>
   <h2>Title A</h2>
   <div>Some Content</div>
   <div>More Content</div>
   <h2>Title B</h2>
</r>

此XPath,

//div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']

将选择此内容,

<div>Some Content</div>
<div>More Content</div>

根据要求在两个h2标题之间.

between the two h2 titles, as requested.

更新以解决OP的自助问题:

对于这个新的XML示例,

For this new XML example,

<div>
    <h2><span>Summary</span></h2>
    <p>Paragraph</p>
    <ul>
        <li>List1</li>
        <li>List2</li>
        <li>List3</li>
    </ul>
    <p>Paragraph</p>

    <h2><span>Location</span></h2>
    <p>Paragraph</p>
</div>

我上面提供的XPath可以很容易地修改,

the XPath I provided above can easily be adapted,

//*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']

选择此XML

<p>Paragraph</p>  
<ul>
   <li>List1</li>
   <li>List2</li>
   <li>List3</li>
</ul>    
<p>Paragraph</p>

根据要求.

这篇关于XPath获取两个标题之间的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆