使用Ruby/Mechanize(和Nokogiri)从HTML提取单个字符串 [英] extract single string from HTML using Ruby/Mechanize (and Nokogiri)

查看:109
本文介绍了使用Ruby/Mechanize(和Nokogiri)从HTML提取单个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从论坛中提取数据.我基于 的脚本运行正常.现在,我需要从单个帖子中提取日期和时间(2009年12月21日,20:39).我无法正常工作.我使用FireXPath来确定xpath.

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.

示例代码:

 require 'rubygems'
 require 'mechanize'

   post_agent = WWW::Mechanize.new
    post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
    puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts  post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts post_page.parser.xpath('//[@id="post1960370"]/tbody/tr[1]/td/div[2]/text()')

我所有的尝试都以空字符串或错误结尾.

all my attempts end with empty string or an error.

我在Mechanize中找不到有关使用Nokogiri的任何文档. 机械化"文档在页面底部说:

I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:

使用Mechanize导航到需要抓取的页面后,然后使用Nokogiri方法进行抓取.

After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.

但是什么方法呢?在哪里可以阅读有关样本的内容并解释语法?我也没有在 Nokogiri的网站上找不到任何内容.

But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.

推荐答案

Radek.我将向您展示如何钓鱼.

Radek. I'm going to show you how to fish.

当您呼叫Mechanize::Page::parser时,它会为您提供Nokogiri文档.因此,您的"xpath"和"at_xpath"调用正在调用Nokogiri.问题出在您的xpaths中.通常,从可以使用的最通用的xpath开始,然后缩小范围.因此,例如,代替此:

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

从此开始:

puts post_page.parser.xpath('//table').to_html

这将获取任何位置的任何表,然后将它们打印为html.检查HTML,以查看它带回了哪些表.当您只想要一张桌子时,它可能会抓住几张桌子,因此您需要告诉它如何挑选一张想要的桌子.例如,如果您发现所需的表具有CSS类"userdata",请尝试以下操作:

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

每当不返回数组时,您都会迷惑xpath,因此请在进行操作之前对其进行修复.一旦获得所需的表,然后尝试获取行:

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

如果这行得通,那就摘下"to_html",您现在有了一个Nokogiri节点数组,每个节点都是一个表行.

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

这就是您的操作方式.

这篇关于使用Ruby/Mechanize(和Nokogiri)从HTML提取单个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆