带有萤火虫的XPath的hpricot [英] hpricot with firebug's XPath
问题描述
我正在尝试使用hpricot从基于表的网站中提取一些信息.我得到了带有FireBug的XPath.
I'm trying to extract some info from a table based website with hpricot. I get the XPath with FireBug.
/html/body/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table[3]/tbody/tr
这不起作用...显然,FireBug的XPath是呈现的HTML的路径,而不是站点中的实际HTML.我了解到删除tbody可能会解决问题.
This doesn't work... Apparently, the FireBug's XPath, is the path of the rendered HTML, and no the actual HTML from the site. I read that removing tbody may resolve the problem.
我尝试:
/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr
仍然不起作用...我做了一些进一步的研究,有人报告说他们让XPath删除了这些数字,所以我尝试这样做:
And still doesn't work... I do a little more research, and some people report they get their XPath removing the numbers, so I try this:
/html/body/div/table/tr/td/table/tr/td/table/tr/td/table/tr/td/table/tr
仍然没有运气...
所以我决定按以下步骤逐步进行:
So I decide to do it step by step like this:
(doc/"html/body/div/table/tr").each do |aaa |
(aaa/"td").each do | bbb|
pp bbb
(bbb/"table/tr").each do | ccc|
pp ccc
end
end
end
我在bbb中找到了我需要的信息,但在ccc中找不到了.
I find the info I need in bbb, but not in ccc.
我在做什么错,还是有更好的工具来删除带有长而复杂的XPath的HTML.
What am I doing wrong, or is there better tool to scrap HTML with long/complex XPath.
推荐答案
您的问题出在XPather(或Firebug XPath)中. 我认为Firefox在内部将格式不正确的表修复为具有tbody元素,即使在HTML中也没有. Nokogiri并没有这样做,相反,它允许将tr标签放在表中.
Your problem is in XPather (or firebug XPath). Firefox i think is internally fixing badly formated tables to have tbody element even if in HTML there is none. Nokogiri is not doing that, instead it allows tr tag to be inside table.
因此,您的路径很有可能像这样进入nokogiri:
so there's a big chance your path looks to nokogiri like this:
/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr
这就是nokogiri会接受它的方式:)
and that's how nokogiri will accept it :)
您可能想看看这个
require 'open-uri'
require 'nokogiri'
class String
def relative_to(base)
(base == self[0..base.length-1]) &&
self[base.length..-1]
end
end
module Importer
module XUtils
module_function
def match(text, source)
case text
when String
source.include? text
when Regexp
text.match(source)
when Array
text.all? {|tt| source.include?(tt)}
else
false
end
end
def find_xpath (doc, start, texts)
xpath = start
found = true
while(found)
found = [:inner_html, :inner_text].any? do |m|
doc.xpath(xpath+"/*").any? do |tag|
tag_text = tag.send(m).strip.gsub(/[\302\240]+/, ' ')
if tag_text && texts.all?{|text| match(text, tag_text)}
xpath = tag.path.to_s
end
end
end
end
(xpath != start) && xpath
end
def fetch(url)
Nokogiri::HTML(open(url).read)
end
end
end
我编写了这个小模块,以帮助我在进行网络抓取和数据挖掘时使用Nokogiri.
I wrote this little module to help me work with Nokogiri when webscraping and data mining.
基本用法:
include XUtils
doc = fetch("http://some.url.here") # http:// is impotrtant!
base = find_xpath(doc, '/html/body', ["what to find1", "What to find 2"]) # when you provide array, then it'll find element conaining ALL words
precise = find_xpath(doc, base, "what to find1")
precise.relative_to base
祝你好运
这篇关于带有萤火虫的XPath的hpricot的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!