在< br>之间提取标签与Nokogiri? [英] Extracting between <br> tags with Nokogiri?

查看:77
本文介绍了在< br>之间提取标签与Nokogiri?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从此网站中提取电话号码和地址使用Nokogiri。它们都在< br> 标签之间。我该如何做到这一点?






如果网站停工,下面是一些HTML的摘录我希望提取电话号码和地址:

 < table width =900style =margin:8px; padding :5px; font-family:Verdana,Geneva,sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc;>< tbody><< ; tr valign =top>< td> 
< strong> Alana's Cafe< / strong>< br>
< em> Cafe / Desserts< / em>
< br>
650 348-0417
< br>
1408 Burlingame Ave
< br>
< a href =http://www.alanascafe.com/burlingame.htmltarget =_blank> http://www.alanascafe.com/burlingame.html< / a>

< / td>< td align =right>
< a href =index.cfm?vid = 44885style =text-decoration:none; color:black>
< img src =iconmap.pngheight =30border =0>< br>
地图< / a>< / td>< / tr>< / tbody>< / table>

< table width =900style =margin:8px; padding:5px; font-family:Verdana,Geneva,sans-serif; font-size:12px; line-height:165 %; color:#333333; border-bottom:1px solid #cccccc;>< tbody>< tr valign =top>< td>
< strong>琥珀月亮印度餐厅和酒吧< / strong>< br>
< em>印度< / em>

< br>
1425 Burlingame Ave


< / td>< td align =right>
< a href =index.cfm?vid = 44872style =text-decoration:none; color:black>
< img src =iconmap.pngheight =30border =0>< br>
地图< / a>< / td>< / tr>< / tbody>< / table>


解决方案

最简单的就像这样:

  data = doc.search('em')。map {| em | em.search('〜br')。map {| br | br.next.text.strip}} 
#=> [[650 348-0417,1408 Burlingame Ave,http://www.alanascafe.com/burlingame.html]等...

这意味着:对于每个em,映射后面的每个兄弟元素br元素之后的文本。

更新



要将其分类为手机/地址,您可以执行以下操作:

  data.map {|行| {:电话=> row [0] [/ ^ [\d \(\) - ] + $ /]? row.shift:nil,:address => row.shift}} 
#=> [{:phone =>650 348-0417,:address =>1408 Burlingame Ave}等等...


I am trying to extract the phone number and the address from this site using Nokogiri. Both of them are between <br> tags. How can I do this?


In case the site is down, here is an excerpt of some of the HTML from which I wish to extract the phone number and address:

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Alana's Cafe</strong><br>
<em>Cafe/Desserts </em>
<br>
650 348-0417
<br>
1408 Burlingame Ave
<br>
<a href="http://www.alanascafe.com/burlingame.html" target="_blank">http://www.alanascafe.com/burlingame.html</a>

</td><td align="right">
<a href="index.cfm?vid=44885" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Amber Moon Indian Restaurant and Bar</strong><br>
<em>Indian </em>

<br>
1425 Burlingame Ave


</td><td align="right">
<a href="index.cfm?vid=44872" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

解决方案

Simplest would be something like:

data = doc.search('em').map{|em| em.search('~ br').map{|br| br.next.text.strip}}
#=> [["650 348-0417", "1408 Burlingame Ave", "http://www.alanascafe.com/burlingame.html"], etc...

That means: For each em, map the text after each following sibling br element.

Update

To sort that into phone / address you could do:

data.map{|row| {:phone => row[0][/^[\d \(\)-]+$/] ? row.shift : nil, :address => row.shift}}
#=> [{:phone=>"650 348-0417", :address=>"1408 Burlingame Ave"}, etc...

这篇关于在&lt; br&gt;之间提取标签与Nokogiri?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆