如何让 Nokogiri 解析并返回 XML 文档? [英] How can I get Nokogiri to parse and return an XML document?

查看:38
本文介绍了如何让 Nokogiri 解析并返回 XML 文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有一些奇怪的例子:

Here's a sample of some oddness:

#!/usr/bin/ruby

require 'rubygems'
require 'open-uri'
require 'nokogiri'

print "without read: ", Nokogiri(open('http://weblog.rubyonrails.org/')).class, "\n"
print "with read:    ", Nokogiri(open('http://weblog.rubyonrails.org/').read).class, "\n"

运行此返回:

without read: Nokogiri::XML::Document
with read:    Nokogiri::HTML::Document

没有 read 返回 XML,有它的是 HTML?该网页被定义为XHTML 过渡",所以起初我认为 Nokogiri 一定是从流中读取 OpenURI 的内容类型",但返回 'text/html':

Without the read returns XML, and with it is HTML? The web page is defined as "XHTML transitional", so at first I thought Nokogiri must have been reading OpenURI's "content-type" from the stream, but that returns 'text/html':

(rdb:1) doc = open(('http://weblog.rubyonrails.org/'))
(rdb:1) doc.content_type
"text/html"

这是服务器返回的内容.所以,现在我想弄清楚为什么 Nokogiri 返回两个不同的值.它似乎没有解析文本并使用启发式方法来确定内容是 HTML 还是 XML.

which is what the server is returning. So, now I'm trying to figure out why Nokogiri is returning two different values. It doesn't appear to be parsing the text and using heuristics to determine whether the content is HTML or XML.

该页面指向的 ATOM 提要发生了同样的事情:

The same thing is happening with the ATOM feed pointed to by that page:

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails').read)
(rdb:1) doc.class
Nokogiri::HTML::Document

我需要能够在事先不知道页面内容的情况下解析页面,无论是 HTML 还是提要(RSS 或 ATOM),并可靠地确定它是哪个.我让 Nokogiri 解析 HTML 或 XML 提要文件的正文,但我看到了那些不一致的结果.

I need to be able to parse a page without knowing what it is in advance, either HTML or a feed (RSS or ATOM) and reliably determine which it is. I asked Nokogiri to parse the body of either a HTML or XML feed file, but I'm seeing those inconsistent results.

我以为我可以编写一些测试来确定类型,但后来我遇到了 xpaths 找不到元素,但常规搜索工作的问题:

I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc.xpath('/feed/entry').length
0
(rdb:1) doc.search('feed entry').length
15

我认为 xpaths 可以处理 XML,但结果看起来也不可信.

I figured xpaths would work with XML but the results don't look trustworthy either.

这些测试都是在我的 Ubuntu 机器上完成的,但我在 Macbook Pro 上看到了相同的行为.我很想知道我做错了什么,但我还没有看到一个解析和搜索的例子,它给了我一致的结果.任何人都可以告诉我我的方式错误吗?

These tests were all done on my Ubuntu box, but I've seen the same behavior on my Macbook Pro. I'd love to find out I'm doing something wrong, but I haven't seen an example for parsing and searching that gave me consistent results. Can anyone show me the error of my ways?

推荐答案

这与 Nokogiri 的方式有关 解析方法 有效.来源:

It has to do with the way Nokogiri's parse method works. Here's the source:

# File lib/nokogiri.rb, line 55
    def parse string, url = nil, encoding = nil, options = nil
      doc =
        if string =~ /^\s*<[^Hh>]*html/i # Probably html
          Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
        else
          Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
        end
      yield doc if block_given?
      doc
    end

关键是行 if string =~/^\s*<[^Hh>]*html/i # 可能是 html.当您只使用 open 时,它返回一个不适用于正则表达式的对象,因此它总是返回 false.另一方面,read 返回一个字符串,所以它可以被视为HTML.在这种情况下是这样,因为它匹配该正则表达式.这是该字符串的开头:

The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:

<!DOCTYPE html PUBLIC

正则表达式将!DOCTYPE"匹配到[^Hh>]*,然后匹配html",从而假设它是HTML.为什么有人选择这个正则表达式来确定文件是否是 HTML 超出了我的理解.使用此正则表达式,以 之类的标记开头的文件被视为 HTML,但 > 被认为是 XML.您最好远离这个愚蠢的函数并直接调用 Nokogiri::HTML::Document#parseNokogiri::XML::Document#parse.

The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.

这篇关于如何让 Nokogiri 解析并返回 XML 文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆