XPath获取两个标题之间的标记 [英] XPath to get markup between two headings
问题描述
我正在尝试编写一个小应用程序以从Wikipedia页面提取内容.当我最初考虑是否可以使用XPath定位包含内容的div时,但是在研究了Wikipedia如何构建其文章之后,我很快发现这并不是一件容易的事.进入页面时,分隔内容的最佳方法是选择两组h2
标签之间的内容.
I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2
tags.
示例:
<h2>Title</h2> <div>Some Content</div> <h2>Title</h2>
在这里,我要获取标题集之间的div
.我尝试使用XPath进行此操作,但是一点都不运气.我将对XPath进行更多的研究,因为我认为这是实现我想要的目标所需要的,但是在我对XPath进行过多研究之前,我想听听你们对它的看法. XPath是正确的方法还是我还有其他更简单的选择?如果有任何区别,我将用C#编写应用程序.
Here I would want to get the div
between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think that's what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.
推荐答案
是的,您使用XPath的方向正确-非常适合选择XML文档的一部分.
例如,对于此XML,
<r>
<h2>Title A</h2>
<div>Some Content</div>
<div>More Content</div>
<h2>Title B</h2>
</r>
此XPath,
//div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']
将选择此内容,
<div>Some Content</div>
<div>More Content</div>
根据要求在两个h2
标题之间.
between the two h2
titles, as requested.
更新以解决OP的自助问题:
对于这个新的XML示例,
For this new XML example,
<div>
<h2><span>Summary</span></h2>
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>
<h2><span>Location</span></h2>
<p>Paragraph</p>
</div>
我上面提供的XPath可以很容易地修改,
the XPath I provided above can easily be adapted,
//*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']
选择此XML
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>
根据要求.
这篇关于XPath获取两个标题之间的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!