使用Nokogiri在BR标签上拆分内容 [英] Using Nokogiri to Split Content on BR tags
问题描述
我有一小段我试图用nokogiri解析的代码片段:
I have a snippet of code im trying to parse with nokogiri that looks like this:
<td class="j">
<a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
</td>
我可以使用以下方式访问td.j的源:
data_items = doc.css("td.j")
I have access to the source of the td.j using something like this:
data_items = doc.css("td.j")
我的目标是将这些行中的每行分成一个散列数组.我可以看到的唯一逻辑拆分点是在BR上拆分,然后在字符串上使用一些正则表达式.
My goal is to split each of those lines up into an array of hashes. The only logical splitting point i can see is to split on the BRs and then use some regex on the string.
我想知道是否有更好的方法可以仅使用nokogiri?即使我可以使用nokogiri吸收3个订单项,也可以使事情变得更轻松,因为我可以对.content结果进行一些正则表达式解析.
I was wondering if there's a Better way to do this maybe using nokogiri only? Even if i could use nokogiri to suck out the 3 line items it would make things easier for me as i could just do some regex parsing on the .content result.
虽然不确定如何使用Nokogiri来抓取以br结尾的行-我应该使用xpaths吗?任何方向表示赞赏!谢谢
Not sure how to use Nokogiri to grab lines ending with br though -- should i be using xpaths? any direction is appreciated! thank you
推荐答案
如果您的数据确实是常规数据,并且不需要<a>
元素的属性,则可以解析每个表单元格的文本形式完全不用担心<br>
元素.
If your data really is that regular and you don't need the attributes from the <a>
elements, then you could parse the text form of each table cell without having to worry about the <br>
elements at all.
在html
中提供一些这样的HTML:
Given some HTML like this in html
:
<table>
<tbody>
<tr>
<td class="j">
<a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
</td>
<td class="j">
<a title="title text1" href="http://link4.com">Link 4</a> (info1), Blah 2,<br>
<a title="title text2" href="http://link5.com">Link 5</a> (info1), Blah 2,<br>
<a title="title text2" href="http://link6.com">Link 6</a> (info2), Blah 2 Foo 2,<br>
</td>
</tr>
<tr>
<td class="j">
<a title="title text1" href="http://link7.com">Link 7</a> (info1), Blah 3,<br>
<a title="title text2" href="http://link8.com">Link 8</a> (info1), Blah 3,<br>
<a title="title text2" href="http://link9.com">Link 9</a> (info2), Blah 3 Foo 2,<br>
</td>
<td class="j">
<a title="title text1" href="http://linkA.com">Link A</a> (info1), Blah 4,<br>
<a title="title text2" href="http://linkB.com">Link B</a> (info1), Blah 4,<br>
<a title="title text2" href="http://linkC.com">Link C</a> (info2), Blah 4 Foo 2,<br>
</td>
</tr>
</tbody>
</table>
您可以这样做:
chunks = doc.search('.j').map { |td| td.text.strip.scan(/[^,]+,[^,]+/) }
并拥有这个:
[
[ "Link 1 (info1), Blah 1", "Link 2 (info1), Blah 1", "Link 3 (info2), Blah 1 Foo 2" ],
[ "Link 4 (info1), Blah 2", "Link 5 (info1), Blah 2", "Link 6 (info2), Blah 2 Foo 2" ],
[ "Link 7 (info1), Blah 3", "Link 8 (info1), Blah 3", "Link 9 (info2), Blah 3 Foo 2" ],
[ "Link A (info1), Blah 4", "Link B (info1), Blah 4", "Link C (info2), Blah 4 Foo 2" ]
]
chunks
中.然后,您可以将其转换为所需的任何哈希形式.
in chunks
. Then you could convert that to whatever hash form you needed.
这篇关于使用Nokogiri在BR标签上拆分内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!