使用机械化从HTML表中提取数据 [英] Extract data from HTML Table with mechanize

查看:82
本文介绍了使用机械化从HTML表中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,这是示例html表:

First of all, here is the sample html table :

 <tr>
   <td><strong>Kangchenjunga </strong></td>
   <td>8,586m<br /></td>
   <td>28,169ft</td>
   <td><div align="center">Nepal/India </div></td>
   <td>1955; G. Band, J. Brown </td>
 </tr>

ARGV [0]将具有一座山的名称(第一个列),返回值应为最后一列,即首次爬山的人.

The ARGV[0] will have the name of a mountain ( the first colomn) and the return value should be the last column, the people who climbed the mountain for the first time.

因此,我需要检查整行的第一列是否为ARGV [0],如果是,那么我应该返回没有日期的最后一列.

So I need to check if the whole rows first column is the ARGV[0], and if it is, then I should return the last column without the date.

require 'mechanize'
p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body
if p.include?('<strong>'+ARGV[0])
   puts 'ok'
end

我有以下内容,如果我在html文档的正文中有ARGV [0],则会显示"ok". 如何搜索同一行的最后一列,其中找到了ARGV [0]?

I've got the following, which prints "ok" if I have the ARGV[0] in the body of the html document. How can I search for the last column of the same row, where the ARGV[0] is found?

示例:

<tr>
 <td><strong>GIVE THIS AS A PARAMETER </strong></td>
 <td>SKIP THIS<br /></td>
 <td>SKIP THIS</td>
 <td><div align="center">SKIP THIS</div></td>
 <td>I WANT IT TO RETURN THIS</td>
</tr>

我真的是Ruby新手

推荐答案

更简洁的版本更加依赖XPath的黑魔法:)

More succint version relying more on the black magic of XPath :)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")

puts last_td.text.gsub(/.*?;/, '').strip

这篇关于使用机械化从HTML表中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆