Nokogiri 保持 HTML 实体不变 [英] Nokogiri leaving HTML entities untouched

查看:33
本文介绍了Nokogiri 保持 HTML 实体不变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望 Nokogiri 保持 HTML 实体不变,但它似乎正在将实体转换为实际的符号.例如:

I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:

 Nokogiri::HTML.fragment('<p>&reg;</p>').to_s

结果:"

似乎没有什么可以将原始 HTML 返回给我..inner_html、.text、.content 方法都返回 '®' 而不是 '&reg;'

Nothing seems to return the original HTML back to me. The .inner_html, .text, .content methods all return '®' instead of '&reg;'

有没有办法让 Nokogiri 保持这些 HTML 实体不变?

Is there a way for Nokogiri to leave these HTML entities untouched?

我已经搜索过 stackoverflow 并发现了类似的问题,但没有一个完全像这个.

I've already searched stackoverflow and found similar questions, but nothing exactly like this one.

推荐答案

不是理想的答案,但您可以通过设置允许的编码来强制它生成实体(如果不是很好的名称):

Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:

#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>&reg;</p>')
puts html.to_html                              #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' )       #=> <p>&#174;</p>

如果 Nokogiri 在定义的地方使用漂亮"的实体名称,而不是总是使用简洁的十六进制实体,那就太好了,但即使这样也不会保留"原始实体.

It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.

问题的根源在于,在 HTML 中,以下描述的内容完全相同:

The root of the problem is that, in HTML, the following all describe the exact same content:

<p>®</p>
<p>&reg;</p>
<p>&#xAE;</p>  
<p>&#174;</p>

如果您希望文本节点的 to_s 表示实际上是 &reg; 那么描述它的标记实际上应该是:<p>&amp;reg;</p>.

If you wanted the to_s representation of a text node to be actually &reg; then the markup describing that would really be: <p>&amp;reg;</p>.

如果 Nokogiri 总是为每个字符返回与用于输入文档相同的编码,则需要将每个字符存储为记录实体引用的自定义节点.存在一个可能用于此的类 (Nokogiri::XML::EntityReference):

If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):

require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo&reg;</p>

但是,我找不到在使用 Nokogiri v1.4.4 或 v1.5.0 进行解析期间创建这些的方法.具体来说,是否存在 Nokogiri::XML::ParseOptions::NOENT 在解析过程中似乎不会导致创建:

However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:

require 'nokogiri'
html = "<p>Foo&reg;</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML,
  Nokogiri::XML::ParseOptions::DEFAULT_XML,
  Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
  p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">

这篇关于Nokogiri 保持 HTML 实体不变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆