带有萤火虫的XPath的hpricot [英] hpricot with firebug's XPath

查看:81
本文介绍了带有萤火虫的XPath的hpricot的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用hpricot从基于表的网站中提取一些信息.我得到了带有FireBug的XPath.

I'm trying to extract some info from a table based website with hpricot. I get the XPath with FireBug.

/html/body/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table[3]/tbody/tr

这不起作用...显然,FireBug的XPath是呈现的HTML的路径,而不是站点中的实际HTML.我了解到删除tbody可能会解决问题.

This doesn't work... Apparently, the FireBug's XPath, is the path of the rendered HTML, and no the actual HTML from the site. I read that removing tbody may resolve the problem.

我尝试:

/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr

仍然不起作用...我做了一些进一步的研究,有人报告说他们让XPath删除了这些数字,所以我尝试这样做:

And still doesn't work... I do a little more research, and some people report they get their XPath removing the numbers, so I try this:

/html/body/div/table/tr/td/table/tr/td/table/tr/td/table/tr/td/table/tr

仍然没有运气...

所以我决定按以下步骤逐步进行:

So I decide to do it step by step like this:

(doc/"html/body/div/table/tr").each do |aaa |
  (aaa/"td").each do | bbb|
        pp bbb
        (bbb/"table/tr").each do | ccc|
            pp ccc 
      end
  end
end

我在bbb中找到了我需要的信息,但在ccc中找不到了.

I find the info I need in bbb, but not in ccc.

我在做什么错,还是有更好的工具来删除带有长而复杂的XPath的HTML.

What am I doing wrong, or is there better tool to scrap HTML with long/complex XPath.

推荐答案

您的问题出在XPather(或Firebug XPath)中. 我认为Firefox在内部将格式不正确的表修复为具有tbody元素,即使在HTML中也没有. Nokogiri并没有这样做,相反,它允许将tr标签放在表中.

Your problem is in XPather (or firebug XPath). Firefox i think is internally fixing badly formated tables to have tbody element even if in HTML there is none. Nokogiri is not doing that, instead it allows tr tag to be inside table.

因此,您的路径很有可能像这样进入nokogiri:

so there's a big chance your path looks to nokogiri like this:

/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr

这就是nokogiri会接受它的方式:)

and that's how nokogiri will accept it :)

您可能想看看这个

require 'open-uri'
require 'nokogiri'

class String
  def relative_to(base)
    (base == self[0..base.length-1]) &&
      self[base.length..-1]
  end
end

module Importer
  module XUtils
    module_function

    def match(text, source)
      case text
      when String
        source.include? text
      when Regexp
        text.match(source)
      when Array
        text.all? {|tt| source.include?(tt)}
      else
        false
      end
    end

    def find_xpath (doc, start, texts)
      xpath = start
      found = true

      while(found)
        found = [:inner_html, :inner_text].any? do |m|
          doc.xpath(xpath+"/*").any? do |tag|
            tag_text = tag.send(m).strip.gsub(/[\302\240]+/, ' ')
            if tag_text && texts.all?{|text| match(text, tag_text)}
              xpath = tag.path.to_s
            end
          end
        end
      end

      (xpath != start) && xpath
    end

    def fetch(url)
      Nokogiri::HTML(open(url).read)
    end
  end
end

我编写了这个小模块,以帮助我在进行网络抓取和数据挖掘时使用Nokogiri.

I wrote this little module to help me work with Nokogiri when webscraping and data mining.

基本用法:

 include XUtils
 doc = fetch("http://some.url.here") # http:// is impotrtant!

 base = find_xpath(doc, '/html/body', ["what to find1", "What to find 2"]) # when you provide array, then it'll find element conaining ALL words

 precise = find_xpath(doc, base, "what to find1")
 precise.relative_to base

祝你好运

这篇关于带有萤火虫的XPath的hpricot的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆