我们需要多少字节来存储一个阿拉伯字符 [英] How many bytes do we need to store an arabic character

查看:35
本文介绍了我们需要多少字节来存储一个阿拉伯字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对表示阿拉伯字符所需的存储空间有些困惑.

I'm a little confused about the storage needed for representing an arabic character.

如果这是真的,请告诉我:

Please let me know if this is true:

  • 在 ISO/IEC 8859-6 编码中需要 2 个字节 (http://en.wikipedia.org/wiki/ISO/IEC_8859-6)
  • 在 UNICODE 中需要 4 个字节 (http://en.wikipedia.org/wiki/Arabic_Unicode)

每种编码的优点是什么?我们什么时候应该更喜欢一个?

What are the advantages of each encoding? When should we prefer one over another one?

推荐答案

首先,Unicode 不是一种编码.它是为每种语言的每个字符分配代码点的标准.这些代码点是整数;它们占用多少字节取决于具体的编码.最常见的 Unicode 编码是 UTF-8 和 UTF-16.

Well first, Unicode is not an encoding. It is a standard for assigning code points to every character in every language. These code points are integers; how many bytes they take up depends on the specific encoding. The most common Unicode encodings are UTF-8 and UTF-16.

总结:

  • ISO 8859-6 为每个阿拉伯字符使用 1 个字节,但不支持阿拉伯语表示形式",也不支持来自除 ASCII 之外的任何其他脚本的字符.
  • UTF-8 为每个阿拉伯字符使用 2 个字节,为阿拉伯语表示形式"使用 3 个字节.
  • UTF-16 为每个阿拉伯字符使用 2 个字节,包括阿拉伯表示形式".

我将使用两个示例:'ح' (U+062D) 和 'ﻰ' (U+FEF0).这些数字是表示每个字符的 Unicode 代码点的十六进制代码.

I will use two examples: 'ح' (U+062D) and 'ﻰ' (U+FEF0). Those numbers are hexadecimal codes representing the Unicode code point of each of those characters.

在 ISO 8859-6 中,大多数阿拉伯字符只占用一个字节,因为该编码专用于阿拉伯语.例如,字符 'ح' (U+062D) 被编码为单字节CD",您可以从 维基百科文章.字符 'ﻰ' (U+FEF0) 被列为Arabic Presentation Form",所以我想这解释了为什么它根本没有出现在 ISO 8859-6 中(你不能用那种编码来编码这个字符).

In ISO 8859-6, most Arabic characters take up just a single byte, since that encoding is dedicated to Arabic. For example, the character 'ح' (U+062D) is encoded as the single byte "CD", as you can see from the table on the Wikipedia article. The character 'ﻰ' (U+FEF0) is listed as an "Arabic Presentation Form", so I suppose that explains why it doesn't appear in ISO 8859-6 at all (you can't encode this character in that encoding).

有两种非常常见的 Unicode 编码可让您对所有字符进行编码:UTF-8UTF-16.它们的用途略有不同.UTF-8 对 ASCII 字符使用 1 个字节,基本字符(包括所有阿拉伯语)使用 2 到 3 个字节,其他字符使用 4 个字节.UTF-16 对基本字符使用两个字节,对其他字符使用 4 个字节.所以基本上,如果您使用大量 ASCII,UTF-8 会更好.对于国际文本,UTF-16 更好.

There are two very common Unicode encodings which let you encode all characters: UTF-8 and UTF-16. They have slightly different uses. UTF-8 uses one byte for ASCII characters, between 2 and 3 bytes for basic characters (including all of Arabic) and 4 bytes for other characters. UTF-16 uses two bytes for basic characters, and 4 bytes for other characters. So basically, if you are using lots of ASCII, UTF-8 is better. For international text, UTF-16 is better.

在UTF-8中,'Í'(U+062D)被编码为2字节序列D8 AD",而'ﻰ'(U+FEF0)被编码为3字节序列EF BB B0"".基本上,U+0080 和 U+07FF 之间的字符使用 2 个字节,U+07FF 和 U+FFFF 之间的字符使用 3 个字节.所以所有基本的阿拉伯语和阿拉伯语补充字符使用 2 个字节,而阿拉伯语表示形式使用 3 个字节.

In UTF-8, 'ح' (U+062D) is encoded as the 2-byte sequence "D8 AD", while 'ﻰ' (U+FEF0) is encoded as the 3-byte sequence "EF BB B0". Basically, characters between U+0080 and U+07FF use 2 bytes, and characters between U+07FF and U+FFFF use 3 bytes. So all the basic Arabic and Arabic supplement characters use 2 bytes, whereas the Arabic Presentation Forms use 3 bytes.

在UTF-16中,'Í'(U+062D)被编码为2字节序列2D 06",而'ﻰ'(U+FEF0)被编码为2字节序列F0 FE".在 UTF-16 中,所有阿拉伯字符都是两个字节.这因字节顺序而进一步复杂化.请注意,UTF-16 中的字节只是两个部分交换的代码点.同样有效的编码是06 2D"用于第一个,FE F0"用于第二个.

In UTF-16, 'ح' (U+062D) is encoded as the 2-byte sequence "2D 06", while 'ﻰ' (U+FEF0) is encoded as the 2-byte sequence "F0 FE". In UTF-16, all Arabic characters are two bytes. This is further complicated by endianness. Note that the bytes in UTF-16 are just the code points with the two parts swapped around. An equally valid encoding is "06 2D" for the first one, and "FE F0" for the second.

总而言之,我通常会推荐 UTF-8,因为它是明确的并且很好地支持 ASCII 文本.阿拉伯字符在任何一种编码中都是 2 个字节(除非您使用表示形式").如果您只使用 ASCII 和阿拉伯字符,则可以使用 ISO 8859-6,而没有其他任何东西,这将为您节省一些空间,但通常不值得,因为一旦出现其他一些字符它就会损坏.UTF-8 和 UTF-16 支持 Unicode 中的所有字符.

In summary, I would usually recommend UTF-8 as it is unambiguous and supports ASCII text very well. Arabic characters are 2 bytes in either encoding (unless you use "presentation forms"). You can use ISO 8859-6 if you are only using ASCII and Arabic characters, and nothing else, and that will save you some space, but it usually isn't worth it, as it will break as soon as some other characters come along. UTF-8 and UTF-16 support all characters in Unicode.

这篇关于我们需要多少字节来存储一个阿拉伯字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆