如何使用 Nokogiri 解析 HTML 表格? [英] How do I parse an HTML table with Nokogiri?
问题描述
我安装了 Ruby 和 Mechanize.在我看来,Nokogiri 可以做我想做的事,但我不知道该怎么做.
这个table
怎么样?它只是 vBulletin 论坛站点 HTML 的一部分.我试图保留 HTML 结构,但删除了一些文本和标签属性.我想获取每个线程的一些详细信息,例如:标题、作者、日期、时间、回复和查看次数.
请注意,HTML 文档中的表格很少?我正在寻找一个带有 <div><span><a>Paul M</a></span> </td><td>2010 年 1 月 6 日<span class="time">23:35</span><br/>来自 <a href="member.php?find=lastposter&t=230708">shane943</a> </td><td><a href="#">24</a></td><td>1,320</td></tr></tbody> I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it. What about this Please note that there are few tables in the HTML document? I am after one particular table with its
这篇关于如何使用 Nokogiri 解析 HTML 表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!tbody
、 的特定表.名称将始终相同(我希望).我可以在代码中使用
tbody
和 name
吗?<tr><!-- 表头--></tr></tbody><!-- 显示线程--><tbody id="threadbits_forum_251"><tr><td></td><td></td><td><div><a href="showthread.php?t=230708" >Vb4 Gold 发布</a>
#!/usr/bin/ruby1.8需要'nokogiri'需要'pp'html = <<-EOS(问题中的 HTML 在这里)EOSdoc = Nokogiri::HTML(html)rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')详细信息 = rows.collect 做 |row|详细信息 = {}[[:title, 'td[3]/div[1]/a/text()'],[:name, 'td[3]/div[2]/span/a/text()'],[:日期,'td[4]/text()'],[:时间,'td[4]/span/text()'],[:number, 'td[5]/a/text()'],[:views, 'td[6]/text()'],].each do |name, xpath|细节[名称] = row.at_xpath(xpath).to_s.strip结尾细节结尾详细信息# =>[{:time="23:35",# =>:title=>"Vb4 Gold 发布",# =>:number=>"24",# =>:date="2010 年 1 月 6 日",# =>:views="1,320",# =>:name=>"保罗 M"}]
table
? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.tbody
, <tbody id="threadbits_forum_251">
. The name will be always the same (I hope). Can I use the tbody
and the name
in the code? <table >
<tbody>
<tr> <!-- table header --> </tr>
</tbody>
<!-- show threads -->
<tbody id="threadbits_forum_251">
<tr>
<td></td>
<td></td>
<td>
<div>
<a href="showthread.php?t=230708" >Vb4 Gold Released</a>
</div>
<div>
<span><a>Paul M</a></span>
</div>
</td>
<td>
06 Jan 2010 <span class="time">23:35</span><br />
by <a href="member.php?find=lastposter&t=230708">shane943</a>
</div>
</td>
<td><a href="#">24</a></td>
<td>1,320</td>
</tr>
</tbody>
</table>
#!/usr/bin/ruby1.8
require 'nokogiri'
require 'pp'
html = <<-EOS
(The HTML from the question goes here)
EOS
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details
# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]
登录
关闭