一个HTML标题，问号的Unicode问题？ 65533; [英] Unicode issue with an HTML Title, question mark? 65533;

查看：371 发布时间：2018/6/15 9:26:56 java html unicode utf-8

本文介绍了一个HTML标题，问号的Unicode问题？ 65533;的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从以下网页解析标题： http：//kid37.blogger。 de / stories / 1670573 /

当我在title元素上使用apache.commons.lang StringEscapeUtils.escapeHTML方法时，我得到以下内容

  Das hermetische Caf&＃65533;：Rock&摔角2010

然而，当我在utf-8编码的网页中显示时，它只显示一个问号。

使用以下代码：

 字符串title = StringEscapeUtils.escapeHtml （myTitle）;

如果我通过这个网站运行标题： http://tools.devshed.com/?option=com_mechtools&tool=27 我得到以下输出，看起来正确

TITLE：
< title> Das hermetischeCafé：Rock&放大器;摔角2010< / title>

BECOMES（我期待escapeHtml方法可以做到这点）：

 < title> Das hermetische Caf& eacute ;: Rock&摔角2010< / title>

有什么想法吗？谢谢

解决方案

U + FFFD（十进制65533）是替换字符。当解码器遇到一个无效字节序列时，它可以（取决于它的配置）替代&＃xFFFD;对于腐败的序列并继续。

腐败序列的一个常见原因是错误的解码器已被应用。例如，解码器可能是UTF-8，但页面实际上是用ISO-8859-1编码的（默认情况下，如果在内容类型头文件中没有指定另一个，则为默认值）。

因此，在将字符串传递给 escapeHtml 之前，é已经被替换为&＃xFFFD;;该方法正确编码。

有问题的页面使用ISO-8859-1编码。确保在将获取的资源转换为字符串时使用该解码器。

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

Das hermetische Caf&#65533;: Rock &amp; Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

Using the following code:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct

TITLE:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

解决方案

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

这篇关于一个HTML标题，问号的Unicode问题？ 65533;的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一个HTML标题，问号的Unicode问题？ 65533; [英] Unicode issue with an HTML Title, question mark? 65533;

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

一个HTML标题，问号的Unicode问题？ 65533; [英] Unicode issue with an HTML Title, question mark? 65533;

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭