如何使用 Ruby 提取碎片化 XML 文档的节点名称? [英] How can I extract the node names for fragmented XML document using Ruby?
问题描述
我是一个类似 XML 的文档,它由我无法控制的系统进行预处理.文档的格式是这样的:
I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:
<template>
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
</template>
但是,我只能将 标记之间的内容作为文本字符串获取.
However, I only get as a text string what is between the <template>
tags.
我希望能够提取而无需在解析时提前指定标签.我可以用 Crack gem 做到这一点,但仅如果标签位于字符串的末尾并且只有一个.
I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.
使用 Crack,我可以输入类似字符串
With Crack, I can put a string like
string = "<SETPROFILE><NAME>email</NAME><VALUE>go@go.com</VALUE></SETPROFILE>"
我的 Crack 输出是:
and my output from Crack is:
{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go@go.com"}}
然后我可以对我关心的可能值使用case
语句.
Then I can use a case
statement for the possible values I care about.
鉴于我需要在字符串中有多个 <tags>
并且它们不能在字符串的末尾,我如何轻松解析出节点名称和值,类似于我用破解做什么?
Given that I need to have multiple <tags>
in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?
这些标签也需要删除.我想继续使用 来自@TinMan 的极好建议.
These tags also need to be removed. I would like to continue to use the excellent suggestion from @TinMan.
一旦我知道标签的名称,它就会完美运行.标签的数量将是有限的.一旦我知道它,我就会将标记发送到适当的方法,但首先需要轻松地解析它.
It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.
推荐答案
使用 Nokogiri,可以将字符串视为 DocumentFragment,然后找到嵌入的节点:
Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
EOT
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n] = n.text
}
nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}
或者,更清晰:
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}
获取特定标签的内容很容易:
Getting the content of a particular tag is easy then:
nodes['RECALL'] # => "first_name"
遍历所有标签也很容易:
Iterating over all the tags is also easy:
nodes.keys.each do |k|
...
end
您甚至可以用文本替换标签及其内容:
You can even replace a tag and its content with text:
doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred. Thanks for giving me your email. \n<SETPROFILE>\n <NAME>email</NAME>\n <VALUE>\n <star/>\n </VALUE>\n</SETPROFILE>. I have just sent you something.\n"
如何替换嵌套标签留给您作为练习.
How to replace the nested tags is left to you as an exercise.
这篇关于如何使用 Ruby 提取碎片化 XML 文档的节点名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!