Nokogiri和Xpath:找到两个标签之间的所有文本 [英] Nokogiri and Xpath: find all text between two tags
问题描述
我不确定这是语法问题还是版本差异问题,但我似乎无法弄清楚。我想从 h2
标签内的(非关闭) td
中的数据添加到 h3
标记。这是HTML的样子。
< td valign =topwidth =350>
< br>< h2> NameIWant< / h2>< br>
< br>小镇< br>
PhoneNumber< br>
< a href =mailto:emailIwant@nowhere.comclass =links> emailIwant@nowhere.com< / a>
< br>
< a href =http://websiteIwant.comclass =links> websiteIwant.com< / a>
< br>< br>
< br>< img src =images / spacer.gif/>< br>
< h3>< b>我想在此之前停止!< / b>< / h3>
Lorem Ipsum Yadda Yadda< br>
< img src =images / spacer.gifborder =0width =20height =11alt =/>< br>
< td width =25>
< img src =images / spacer.gifborder =0width =20height =8alt =/>
< td valign =topwidth =200>< img src =images / spacer.gif/>
< br>
< br>
< a href =http://dontneedthis.com>
< / a>< / td>< / tr>< br>
< table border =0cellpadding = 3cellspacing =0width =200>
...
< td valign>
直到页面最底部才会关闭,我认为这可能是我遇到问题的原因。
我的Ruby代码如下所示:
require'open-uri'
require' nokogiri'
@doc = Nokogiri :: XML(open(http://www.url.com))
content = @ doc.css('/ / td [valign =top] [width =350]')
name = content.xpath('// h2')。text
puts name // Returns NameIwant
townNumberLinks = content.search('// following :: h2')
puts content //返回< h2> NameIWant< / h2>
据我所知,下列语法应该在当前节点的结束标记之后选择文档中的所有内容。如果我尝试在之前使用,例如:
townNumberLinks = content.search ('//先前:: h3')
//我得到:< h3>< b>我想在此之前停止!< / b>< / h3>
希望我明确了我想要做的事情。谢谢!
这不是微不足道的。在您选择的节点( td
)的上下文中,要在 两个元素之间获取所有内容,您需要执行 / em>:
- 设置 A :之前的所有节点 > 第一
h3
:// h3 [1] /在前:: node()
- 设置 B : / code>:
// h2 [1] / following :: node()
要执行交叉点,您可以使用 Kaysian方法 (迈克尔凯)。基本公式是:
$ $ p $ $ $ $ $ $ $ $ code>
将其应用于您的集合,如上所述,其中 A = // h3 [1] / preceding :: node()
和 B = // h2 [1] / following :: node() code>,我们有:
// h3 [1] /在前:: node()[count(。 | // h2 [1] / following :: node())= count(// h2 [1] / following :: node())]
,它会从第一个< br>
之后选择所有元素和文本节点将< / h2>
标记添加到最后一个< br>
之后的空白文本节点, < h3>
标记。
您可以轻松选择 < c> h2 和 h3
替换 node()
$ c> text()在表达式中。这将返回两个标题之间的所有文本节点(包括空格和换行符):
// h3 [1] /在前:: text()[count(。| // h2 [1] / following :: text())= count(// h2 [1] / following :: text())]
I'm not sure if it's a matter of syntax or differences in versions but I can't seem to figure this out. I want to take data that is inside a (non-closing) td
from the h2
tag to the h3
tag. Here is what the HTML would look like.
<td valign="top" width="350">
<br><h2>NameIWant</h2><br>
<br>Town<br>
PhoneNumber<br>
<a href="mailto:emailIwant@nowhere.com" class="links">emailIwant@nowhere.com</a>
<br>
<a href="http://websiteIwant.com" class="links">websiteIwant.com</a>
<br><br>
<br><img src="images/spacer.gif"/><br>
<h3><b>I want to stop before this!</b></h3>
Lorem Ipsum Yadda Yadda<br>
<img src="images/spacer.gif" border="0" width="20" height="11" alt=""/><br>
<td width="25">
<img src="images/spacer.gif" border="0" width="20" height="8" alt=""/>
<td valign="top" width="200"><img src="images/spacer.gif"/>
<br>
<br>
<table cellspacing="0" cellpadding="0" border="0"/>205"><tr><td>
<a href="http://dontneedthis.com">
</a></td></tr><br>
<table border="0" cellpadding="3" cellspacing="0" width="200">
...
The <td valign>
doesn't close until the very bottom of the page which I think might be why I'm having problems.
My Ruby code looks like:
require 'open-uri'
require 'nokogiri'
@doc = Nokogiri::XML(open("http://www.url.com"))
content = @doc.css('//td[valign="top"] [width="350"]')
name = content.xpath('//h2').text
puts name // Returns NameIwant
townNumberLinks = content.search('//following::h2')
puts content // Returns <h2> NameIWant </h2>
As I understand it following syntax should "Selects everything in the document after the closing tag of the current node". If I try to use preceding
like:
townNumberLinks = content.search('//preceding::h3')
// I get: <h3><b>I want to stop before this!</b></h3>
Hope I made it clear what I'm trying to do. Thanks!
It's not trivial. In the context of the nodes you selected (the td
), to get everything between two elements, you need to perform an intersection of these two sets:
- Set A: All the nodes preceding the first
h3
://h3[1]/preceding::node()
- Set B: All the nodes following the first
h2
://h2[1]/following::node()
To perform an intersection, you can use the Kaysian method (after Michael Kay, who proposed it). The basic formula is:
A[count(.|B) = count(B)]
Applying it to your sets, as defined above, where A = //h3[1]/preceding::node()
, and B = //h2[1]/following::node()
, we have:
//h3[1]/preceding::node()[ count( . | //h2[1]/following::node()) = count(//h2[1]/following::node()) ]
which will select all elements and text nodes starting with the first <br>
after the </h2>
tag, to the whitespace text node after the last <br>
, just before the next <h3>
tag.
You can easily select just the text nodes between h2
and h3
replacing node()
for text()
in the expression. This one will return all text nodes (including whitespace and linebreaks) between the two headers:
//h3[1]/preceding::text()[ count( . | //h2[1]/following::text()) = count(//h2[1]/following::text()) ]
这篇关于Nokogiri和Xpath:找到两个标签之间的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!