在创建`Nokogiri :: XML`或`Nokogiri :: HTML`对象时如何避免创建不重要的空白文本节点 [英] How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object

查看:86
本文介绍了在创建`Nokogiri :: XML`或`Nokogiri :: HTML`对象时如何避免创建不重要的空白文本节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解析缩进XML的同时,从结束标记和开始标记之间的空白创建了不重要的空白文本节点.例如,来自以下XML:

While parsing an indented XML, non-significant white space text nodes are created from the white spaces between a closing and an opening tag. For example, from the following XML:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

其字符串表示如下,

 "<note>\n  <to>Tove</to>\n  <from>Jani</from>\n  <heading>Reminder</heading>\n  <body>Don't forget me this weekend!</body>\n</note>\n"

创建以下Document:

#(Document:0x3fc07e4540d8 {
  name = "document",
  children = [
    #(Element:0x3fc07ec8629c {
      name = "note",
      children = [
        #(Text "\n  "),
        #(Element:0x3fc07ec8089c {
          name = "to",
          children = [ #(Text "Tove")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d8064 {
          name = "from",
          children = [ #(Text "Jani")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d588c {
          name = "heading",
          children = [ #(Text "Reminder")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8cf590 {
          name = "body",
          children = [ #(Text "Don't forget me this weekend!")]
          }),
        #(Text "\n")]
      })]
  })

在这里,有很多类型为Nokogiri::XML::Text的空白节点.

Here, there are lots of white space nodes of type Nokogiri::XML::Text.

我想计算Nokogiri XML Document中每个节点的children,并访问第一个或最后一个子节点,不包括不重要的空格.我不希望解析它们,也不希望在那些和重要的文本节点(例如元素<to>中的那些文本节点,例如"Tove")之间进行区分.这是我正在寻找的rspec:

I would like to count the children of each node in a Nokogiri XML Document, and access the first or last child, excluding non-significant white spaces. I wish not to parse them, or distinguish between those and significant text nodes such as those inside the element <to>, like "Tove". Here is an rspec of what I am looking for:

require 'nokogiri'
require_relative 'spec_helper'

xml_text = <<XML
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>
XML

xml = Nokogiri::XML(xml_text)

def significant_nodes(node)
  return 0
end

describe "Stackoverflow Question" do
  it "should return the number of significant nodes in nokogiri." do
    expect(significant_nodes(xml.css('note'))).to eq 4
  end
end

我想知道如何创建significant_nodes函数.

I want to know how to create the significant_nodes function.

如果我将XML更改为:

If I change the XML to:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
  <footer></footer>
</note>

然后,当我创建Document时,我仍然希望页脚表示;不能使用config.noblanks.

then when I create the Document, I still would like the footer represented; using config.noblanks is not an option.

推荐答案

您可以使用 NOBLANKS选项用于解析XML字符串,请考虑以下示例:

You can use the NOBLANKS option for parsing the XML string, consider this example:

require 'nokogiri'

string = "<foo>\n  <bar>bar</bar>\n</foo>"
puts string
# <foo>
#   <bar>bar</bar>
# </foo>

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|
  config.noblanks
end

document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n  ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">

document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>


NOBLANKS不应删除空节点:


The NOBLANKS shouldn't remove empty nodes:

doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
  config.noblanks
end

doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">


OP指出了Nokogiri网站上的文档(以及 libxml网站上的文档)关于解析器选项的解释很神秘,遵循NOBLANKS选项的行为规范:


As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the NOBLANKS option:

require 'rspec/autorun'
require 'nokogiri'

def parse_xml(xml_string)
  Nokogiri.XML(xml_string) { |config| config.noblanks }
end

describe "Nokogiri NOBLANKS parser option" do

  it "removes whitespace nodes if they have siblings" do
    doc = parse_xml("<root>\n <child></child></root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

  it "doesn't remove whitespaces nodes if they have no siblings" do
    doc = parse_xml("<root>\n </root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
  end

  it "doesn't remove empty nodes" do
    doc = parse_xml('<root><child></child></root>')
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

end

这篇关于在创建`Nokogiri :: XML`或`Nokogiri :: HTML`对象时如何避免创建不重要的空白文本节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆