什么字符编码最适合跨国公司 [英] What Character Encoding is best for multinational companies

查看:151
本文介绍了什么字符编码最适合跨国公司的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您有一个要翻译成世界上每种语言的网站,并因此拥有一个包含所有这些翻译的数据库,哪种字符编码将是最佳选择? UTF-128?

If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128?

如果是这样,所有浏览器都能理解所选的编码吗? 字符编码是直接实现还是存在隐藏因素?

If so do all browsers understand the chosen encoding? Is character encoding straight forward to implement or are there hidden factors?

谢谢.

推荐答案

如果要为Web内容支持多种语言,则应使用覆盖整个Unicode范围的编码.为此目的的最佳选择是UTF-8. UTF-8是网络的首选编码;来自 HTML5草案标准:

If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:

鼓励作者使用UTF-8.一致性检查程序可能会建议作者不要使用旧版编码. [RFC3629]

Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]

对于新创建的文档,创作工具应默认使用UTF-8. [RFC3629]

Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]

UTF-8和Windows-1252是浏览器唯一需要支持的编码,而UTF-8和UTF-16是XML解析器唯一需要支持的编码.因此,UTF-8是所有内容都需要支持的唯一通用编码.

UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.

以下内容是对Liv答案的扩展回答,而不是单独回答.这说明了即使对于CJK内容,为什么UTF-8比UTF-16更可取.

The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.

对于ASCII范围内的字符,UTF-8比UTF-16更紧凑(1个字节对2个字节).对于介于ASCII范围和U + 07FF之间的字符(包括拉丁扩展,西里尔字母,希腊语,阿拉伯语和希伯来语),UTF-8每个字符还使用两个字节,因此很容易使用.对于基本多语言平面"之外的字符,UTF-8和UTF-16均使用每个字符4个字节,因此在这里是很容易的事情.

For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.

从U + 07FF到U + FFFF的字符(包括印度字母和CJK),UTF-16比UTF-8更有效的唯一范围.即使对于该范围内的许多文本,UTF-8也具有可比性,因为该文本的标记(HTML,XML,RTF或您拥有的内容)全部在ASCII范围内,而UTF-8仅为一半UTF-16的大小.

The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.

例如,如果我从日语中随机选择一个网页,即nhk.or.jp的主页,则会以UTF-8进行编码.如果我将其转码为UTF-16,它的大小几乎会变成其原始大小的两倍:

For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:


$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r--  1 lambda  lambda  32416 Mar 13 13:06 nhk.16.html
-rw-r--r--  1 lambda  lambda  18337 Mar 13 13:04 nhk.html

几乎在所有方面,UTF-8都比UTF-16更好.它们都是可变宽度编码,因此具有复杂性.但是,在UTF-16中,4字节字符很少见,因此进行固定宽度的假设并使其正常工作要容易得多,直到遇到一个没有碰到的特殊情况.在编码CESU-8中可以看到这种混淆的一个示例,这是通过将代理对的每一半编码为单独的字符(每个字符使用6个字节)将UTF-16文本转换为UTF-8所得到的;三个字节来编码UTF-8中的代理对的每一半),而不是将代理对解码为它的代码点并将其编码为UTF-8.这种混淆非常普遍,以至于实际上已经对错误的编码进行了标准化,因此至少可以使损坏的程序进行互操作.

UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.

对于大多数内容,UTF-8比UTF-16小得多,而且如果您担心大小,则压缩文本总是比选择其他编码更好. UTF-8与使用以空值终止的字节序列表示字符串的API和数据结构兼容,因此,只要您的API和数据结构不关心编码或已经可以处理其字符串中的不同编码(例如(与大多数C和POSIX字符串处理API一样),UTF-8可以很好地工作,而不必为宽字符使用一套全新的API和数据结构. UTF-16没有指定字节序,因此可以处理字节序问题.实际上,存在三种不同的相关编码,即UTF-16,UTF-16BE和UTF-16LE. UTF-16可以是大字节序,也可以是小字节序,因此需要指定BOM. UTF-16BE和LE是大小不等的字节序版本,没有BOM,因此您需要使用带外方法(例如Content-Type HTTP标头)来表示正在使用哪个方法,但是带内标头因错误或丢失而臭名昭著.

UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.

UTF-16基本上是偶然的,因为人们认为16位就足以对所有Unicode进行编码,因此开始更改其表示形式和API以使用宽(16位)字符.当他们意识到自己需要更多的字符时,他们想出了一种方案,该方案使用一些保留字符来使用两个代码单元对32位值进行编码,因此他们仍然可以使用相同的数据结构进行新的编码.这带来了像UTF-8这样的可变宽度编码的所有缺点,而没有大多数优点.

UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.

这篇关于什么字符编码最适合跨国公司的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆