使用 XPath 对 HTML/XML 文档中的兄弟姐妹进行分组? [英] Use XPath to group siblings from an HTML/XML document?
问题描述
我想通过对以前未分组的同级节点进行分组来转换 HTML 或 XML 文档.
例如,我想取以下片段:
Header
<p>第一段</p><p>第二段</p><h2>第二个标题</h2><p>第三段</p><p>第四段</p>
进入这个:
<h2>标题</h2><p>第一段</p><p>第二段</p></节><部分><h2>第二个标题</h2><p>第三段</p><p>第四段</p></节>
这是否可以使用简单的 Xpath 选择器和像 Nokogiri 这样的 XML 解析器?或者我是否需要为此任务实现 SAX 解析器?
更新答案
这是一个通用的解决方案,它根据标题级别及其以下兄弟元素创建 元素的层次结构:
class Nokogiri::XML::Node# 根据标题级别在文档上创建层次结构# 包装:例如<节>"或"#stops :停止所有部分的标签名称数组;不使用 nil# levels :按顺序控制嵌套的标签名称数组def auto_section(wrap='<section>',stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])级别 = 哈希[levels.zip(0...levels.length)]停止 = 停止 &&哈希[stops.product([true])]堆栈 = []children.each 做 |节点|除非级别 = 级别[节点名称]级别 = 停止 &&停止[节点名称] &&-1结尾stack.pop while (top=stack.last) &&top[:level]>=level if levelstack.last[:section].add_child(node) 如果 stack.last如果级别 &&等级 >=0section = Nokogiri::XML.fragment(wrap).children[0]node.replace(section);部分<<节点堆栈<<{ :section=>section, :level=>level }结尾结尾结尾结尾这是正在使用的代码,以及它给出的结果.
原始 HTML
<h1>主要部分 1</h1><p>介绍</p><h2>子目1.1</h2><p>肉</p><p>MOAR MEAT</p><h2>子目1.2</h2><p>肉</p><h3>注意事项</h3><p>仅供参考</p><h4>ProTip</h4><p>完成它</p><h2>子目1.3</h2><p>肉</p><h1>主要部分 2</h1><h3>跳进去!</h3><p>跳过关卡!</p><h2>子目2.1</h2><p>备份...</p><h4>潜水!潜水!<p>...向下</p><hr/><p id="footer">版权©全部完成</p>
转换代码
# 仅使用 XML 以便我们可以漂亮地打印结果;HTML 也能正常工作doc = Nokogiri::XML(html,&:noblanks) # 去除空格允许缩进doc.at('body').auto_section # 让魔法发生puts doc.to_xhtml # 用缩进显示结果
结果
<部分><h1>主要部分 1</h1><p>介绍</p><部分><h2>子目1.1</h2><p>肉</p><p>MOAR MEAT</p></节><部分><h2>子目1.2</h2><p>肉</p><部分><h3>注意事项</h3><p>仅供参考</p><部分><h4>ProTip</h4><p>完成它</p></节></节></节><部分><h2>子目1.3</h2><p>肉</p></节></节><部分><h1>主要部分 2</h1><部分><h3>跳进去!</h3><p>跳过关卡!</p></节><部分><h2>子目2.1</h2><p>备份...</p><部分><h4>潜水!潜水!<p>...向下</p></节></节></节><小时/><p id="footer">版权全部完成</p>
原答案
这里的答案不使用 XPath,而是使用 Nokogiri.我冒昧地使解决方案有点灵活,处理任意开始/停止(但不是嵌套部分).
html = "Header
<p>第一段</p><p>第二段</p><h2>第二个标题</h2><p>第三段</p><p>第四段</p><小时><p id='footer'>大功告成!</p>"需要'nokogiri'类 Nokogiri::XML::Node# 提供一个返回的块:# true - 对于应该开始新部分的节点# false - 对于不应开始新部分的节点# :stop - 对于应该停止任何当前节但不开始新节的节点def group_under(name="section")组 = 无element_children.each 做 |child|案例产量(子)当为假时,无组<<孩子如果组什么时候:停止组 = 无别的group = document.create_element(name)child.replace(组)组<<孩子结尾结尾结尾结尾doc = Nokogiri::HTML(html)doc.at('body').group_under 做 |node|如果 node.name == 'hr':停止别的%w[h1 h2 h3 h4 h5 h6].include?(node.name)结尾结尾把文档#=><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#=><html><body>#=><section><h2>标题</h2>#=><p>第一段</p>#=><p>第二段</p></section>#=>#=><section><h2>第二个头部</h2>#=><p>第三段</p>#=><p>第四段</p></section>#=>#=><小时>#=><p id="footer">大功告成!</p>#=></body></html>
对于 XPath,请参阅 XPath:选择所有后续兄弟姐妹,直到另一个兄弟姐妹
I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes.
For example, I want to take the following fragment:
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
Into this:
<section>
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</section>
<section>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
</section>
Is this possible using simple Xpath selectors and an XML parser like Nokogiri? Or do I need to implement a SAX parser for this task?
解决方案 Updated Answer
Here's a general solution that creates a hierarchy of <section>
elements based on header levels and their following siblings:
class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end
Here is this code in use, and the result it gives.
The original HTML
<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>
<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>
<hr /><p id="footer">Copyright © All Done</p>
</body>
The conversion code
# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation
The result
<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
<hr>
<p id='footer'>All done!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end
doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>
For XPath, see XPath : select all following siblings until another sibling
这篇关于使用 XPath 对 HTML/XML 文档中的兄弟姐妹进行分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文