字符编码和“问题" [英] Character Encoding and the ’ Issue

查看:23
本文介绍了字符编码和“问题"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使在今天,人们仍然经常看到非常频繁的字符编码问题.以最近的职位发布为例:

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:

(注意:这是一个例子,不是垃圾邮件... :-)

(Note: This is an example, not a spam job post... :-)

我最近在网站、流行的 IM 程序和 CNN 的背景图片中看到了那个确切的错误.

I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.

我的两部分问题:

  • 是什么导致了这个特殊的、常见的编码问题?
  • 作为开发人员,我应该如何处理用户输入以避免常见的编码问题,例如这个?如果这个问题需要简化以提供一个有意义的答案,假设内容是通过网络浏览器输入的.

推荐答案

是什么导致了这个特殊的、常见的编码问题?

当使用错误的字符集进行字符和字节之间的转换时,就会发生这种情况.计算机将数据作为字节处理,但为了以对人类合理的方式表示数据,必须将其转换为字符(字符串).这种转换基于一个字符集,其中有许多不同的字符集.

This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.

在特定的 ’ 示例中,这是 Unicode 字符 'RIGHT SINKLE QUOTATION MARK' (U+2019) ' 使用 UTF-8 读取.在 UTF-8 中,该字符存在于字节 0xE20x800x99 中.如果您查看 CP1252 代码页布局,您会看到这些字节代表正是字符 â.

In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, and .

这可能是由于网站没有正确读取原始来源(它应该为此使用 CP1252),或者显示带有错误 charset=CP1252 属性的 UTF-8 页面在 Content-Type 响应头中(或缺少属性;在 Windows 机器上,将使用 CP1252 的默认字符集).

This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).

作为开发人员,我应该如何处理用户输入以避免此类常见的编码问题?如果此问题需要简化以提供有意义的答案,请假设内容是通过网络浏览器输入的.

确保您使用已知和预定义的字符集从任意字节流源(例如文件、URL、网络套接字等)读取字符.然后,确保您始终使用 Unicode 字符集(最好是 UTF-8)来存储、写入和发送它.

Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.

如果您熟悉 Java(您的问题历史证实了这一点),您可能会发现 这篇文章很有用.

If you're familiar with Java (your question history confirms this), you may find this article useful.

这篇关于字符编码和“问题"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆