带有 HTML 标题、问号的 Unicode 问题?65533; [英] Unicode issue with an HTML Title, question mark? 65533;
问题描述
我正在尝试解析以下网页中的标题:http://kid37.blogger.de/stories/1670573/
I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/
当我在标题元素上使用 apache.commons.lang StringEscapeUtils.escapeHTML 方法时,我得到以下内容
When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following
Das hermetische Caf�: Rock & Wrestling 2010
但是,当我使用 utf-8 编码在我的网页中显示它时,它只显示一个问号.
however when I display that in my webpage with utf-8 encoding it just shows a question mark.
使用以下代码:
String title = StringEscapeUtils.escapeHtml(myTitle);
如果我通过这个网站运行标题:http://tools.devshed.com/?option=com_mechtools&tool=27 我得到以下似乎正确的输出
If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct
标题:
<title>Das hermetische Café: Rock & Wrestling 2010</title>
BECOMES(我期待 escapeHtml 方法能做到):
BECOMES (which I was expecting the escapeHtml method to do):
<title>Das hermetische Café: Rock & Wrestling 2010</title>
有什么想法吗?谢谢
推荐答案
U+FFFD(十进制 65533)是替换字符".当解码器遇到无效的字节序列时,它可能(取决于其配置)替换 �对于损坏的序列并继续.
U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.
损坏"序列的一个常见原因是应用了错误的解码器.例如,解码器可能是 UTF-8,但页面实际上是用 ISO-8859-1 编码的(如果在 content-type 标头或等效标头中未指定另一个,则为默认值).
One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
因此,在您甚至将字符串传递给 escapeHtml
之前,é"已被替换为 "�";该方法正确编码.
So, before you even pass the string to escapeHtml
, the "é" has already been replaced with "�"; the method encodes this correctly.
有问题的页面使用 ISO-8859-1 编码.确保在将获取的资源转换为 String
时使用该解码器.
The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String
.
这篇关于带有 HTML 标题、问号的 Unicode 问题?65533;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!