字符编码和问题 [英] Character Encoding and the ’ Issue

查看:167
本文介绍了字符编码和问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使在今天,人们常常看到频繁出现的字符编码问题。以这个最近的职位信息为例:



/ p>

我最近看到的网站,流行的IM程序和CNN背景图形中的确切错误。



我的两部分问题:




  • 问题?

  • 作为开发人员,我应该如何处理用户输入,以避免常见的编码问题,如
    这一个?如果这个问题需要简化以提供
    有意义的答案,则假定通过网络浏览器输入内容。


解决方案


是什么原因导致此特殊的常见编码问题?


当使用错误的字符集进行字符和字节之间的转换时,会发生这种情况。计算机将数据作为字节处理,但为了以对人类合理的方式表示数据,它必须转换为字符(字符串)。



在特定的 示例中,这是 Unicode字符'RIGHT SINQLE QUOTATION MARK的典型CP1252表示形式'(U + 2019) ',使用UTF-8读取。在UTF-8中,字符存在字节 0xE2 0x80 0x99 。如果您选中 CP1252代码页布局,则会看到这些字节表示正确的字符â p>

这可能是因为网站没有正确读取原始来源(它应该使用CP1252),或显示一个UTF-8页面,错误的 code> charset = CP1252 属性在 Content-Type 响应头(或属性丢失;在Windows机器上的默认字符集CP1252







作为开发人员,与用户输入,以避免像这样的常见编码问题?如果此问题需要简化以提供有意义的答案,则假定通过网络浏览器输入内容。


使用已知和预定义的字符集从任意字节流源(例如,文件,URL,网络套接字等)读取字符。然后,确保您始终使用Unicode字符集(最好是UTF-8)存储,写入和发送它。



如果您熟悉Java确认此),您可能会发现本文< a>有用。


Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:

(Note: This is an example, not a spam job post... :-)

I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.

My two-part question:

  • What causes this particular, common encoding issue?
  • As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.

解决方案

What causes this particular, common encoding issue?

This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.

In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, and .

This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).


As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.

Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.

If you're familiar with Java (your question history confirms this), you may find this article useful.

这篇关于字符编码和问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆