’正在转换为“\u0092";by nokogiri 在 ruby​​ on rails [英] ’ is getting converted as "\u0092" by nokogiri in ruby on rails

查看:70
本文介绍了’正在转换为“\u0092";by nokogiri 在 ruby​​ on rails的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 html 页面,其中包含一些 html 实体,如’".

I have html page which has following line with some html entities like "’".

#Here I am not pasting whole html page content. just putting issue line only
html_file = "<html>....<body><p>they&#146;re originally intended to describe the spread of of viral diseases, but they&amp;#146;re nice analogies for how web/SN apps grow.<p> ...</body></html>"


doc   = Nokogiri::HTML(html)
body  = doc.xpath('//body')
body_content = body[0].inner_html

puts body_content  

结果:

These terms come from the fields of medicine and biology  they\u0092re originally intended to describe the spread of of viral diseases, but they\u0092re nice analogies for how web/SN apps grow.

我想保留这些实体,而不是将其更改为 unicode.有什么东西,我是不是失踪了?

I want to leave these entities as it is instead of changing it to unicode. Any thing, Am I missing?

谢谢

推荐答案

they&#146;re

是错误的,应该避免.如果你想在那里使用单引号来重现将撇号呈现为斜引号的印刷实践,那么正确的字符是 U+2019 RIGHT SINGLE QUOTATION MARK,可以写成 &#x2019;&#8217;.或者,如果您使用 UTF-8,只需逐字包含 .

is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as &#x2019; or &#8217;. Or, if you're using UTF-8, just included verbatim as .

&#146; 应该指字符 U+0092,这是一个很少使用且毫无意义的控制字符,通常呈现为空白或缺少字形框.事实上,在 XML 中,确实如此.

&#146; should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.

但在 HTML 中(XHTML 除外,它使用 XML 规则),这是一个长期存在的浏览器怪癖,字符引用范围为 &#128;&#159; 被误解为与 Windows 西方代码页 (cp1252) 中的字节 128 到 159 相关联的字符,而不是具有这些代码点的 Unicode 字符.HTML5 标准最终记录了这种行为.

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

问题是 Nokogiri 不知道这个怪癖,并在其词中使用字符引用 146,最后得到您并不真正想要的字符 146 (\u0092).我认为 Nokogiri 正在使用 libxml2 来解析 HTML,因此最终正确的修复方法是使用 libxml2 的 htmlParseCharRef 函数,以替换字符 128-159.

The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef function, to substitute characters 128–159.

与此同时,您可以尝试在解析之前使用诸如 &#146;->&#x2019; 之类的粗略字符串替换手动修复"字符引用.这有点错误,但至少在 HTML 中,您可以拥有标记序列 &#146; 而不是字符引用的唯一其他地方将在注释中,所以希望它不会如果您不小心更改了那里的内容也没关系.

In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like &#146;->&#x2019; before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence &#146; without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.

这篇关于&amp;#146;正在转换为“\u0092";by nokogiri 在 ruby​​ on rails的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆