Ruby Mechanize屏幕抓取帮助 [英] Ruby Mechanize screen scraping help

查看:90
本文介绍了Ruby Mechanize屏幕抓取帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在带有日期的表格中抓取一行.我只想抓取具有今天日期的第三行.

I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.

这是我的机械化代码.我试图选择今天有日期及其列的列女巫:

This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:

agent.page.search("//td").map(&:text).map(&:strip)

agent.page.search("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

"

我只想抓取今天的第三行.

I want to only scrape the third row that is the date today.

推荐答案

不是使用'//td'遍历<td>标记,而是搜索<tr>标记,仅获取第三个,然后遍历.

Rather than loop over the <td> tags using '//td', search for the <tr> tags, grab only the third one, then loop over '//td'.

Mechanize在内部使用Nokogiri,所以这是使用Nokogiri-ese的方法:

Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

使用机械化agent.page所附的.search('//tr')[2].search('td').map{ |n| n.text },如下所示:

Use the .search('//tr')[2].search('td').map{ |n| n.text } appended to Mechanize's agent.page, like so:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

自从我与《机械化》一起玩已经有一段时间了,所以它也可能是agent.page.parser....

It's been a while since I played with Mechanize, so it might also be agent.page.parser....

表中将有更多行.我要抓取的行始终是倒数第二个.

there will come more rows in the table. The row that i want to scrape is always the second last.

将这些信息放入您的原始问题很重要.您的问题越准确,我们的答案就越准确.

It's important to put that information into your original question. The more accurate your question, the more accurate our answers.

这篇关于Ruby Mechanize屏幕抓取帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆