一个HTML标题,问号的Unicode问题? 65533; [英] Unicode issue with an HTML Title, question mark? 65533;

查看:371
本文介绍了一个HTML标题,问号的Unicode问题? 65533;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从以下网页解析标题: http://kid37.blogger。 de / stories / 1670573 /

当我在title元素上使用apache.commons.lang StringEscapeUtils.escapeHTML方法时,我得到以下内容

  Das hermetische Caf�:Rock&摔角2010 

然而,当我在utf-8编码的网页中显示时,它只显示一个问号。



使用以下代码:

 字符串title = StringEscapeUtils.escapeHtml (myTitle); 

如果我通过这个网站运行标题: http://tools.devshed.com/?option=com_mechtools&tool=27 我得到以下输出,看起来正确



TITLE:

 < title> Das hermetischeCafé:Rock&放大器;摔角2010< / title> 

BECOMES(我期待escapeHtml方法可以做到这点):

 < title> Das hermetische Caf& eacute ;: Rock&摔角2010< / title> 

有什么想法吗?谢谢

解决方案

U + FFFD(十进制65533)是替换字符。当解码器遇到一个无效字节序列时,它可以 (取决于它的配置)替代&#xFFFD;对于腐败的序列并继续。

腐败序列的一个常见原因是错误的解码器已被应用。例如,解码器可能是UTF-8,但页面实际上是用ISO-8859-1编码的(默认情况下,如果在内容类型头文件中没有指定另一个,则为默认值)。

因此,在将字符串传递给 escapeHtml 之前,é已经被替换为&#xFFFD;;该方法正确编码。



有问题的页面使用ISO-8859-1编码。确保在将获取的资源转换为字符串时使用该解码器。


I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

Das hermetische Caf&#65533;: Rock &amp; Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

Using the following code:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct

TITLE:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

解决方案

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

这篇关于一个HTML标题,问号的Unicode问题? 65533;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆