什么是 Unicode、UTF-8、UTF-16? [英] What is Unicode, UTF-8, UTF-16?

查看:28
本文介绍了什么是 Unicode、UTF-8、UTF-16?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Unicode 的基础是什么?为什么需要 UTF-8 或 UTF-16?我在谷歌上研究过这个,也在这里搜索过,但我不清楚.

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

在 VSS 中进行文件比较时,有时会出现一条消息,指出两个文件的 UTF 不同.为什么会这样?

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

请用简单的语言解释.

推荐答案

为什么我们需要 Unicode?

在(不是太早)早期,所有存在的都是 ASCII.这没关系,因为所需要的只是一些控制字符、标点符号、数字和字母,就像这句话中的那样.不幸的是,今天这个全球互通和社交媒体的奇怪世界并没有被预见到,看到英语、العربية、汉语、עִבְרִית、ελληνικά和ភាសាខ្ម៚៚៚មរ浏览器).

Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

但为了论证起见,我们假设 Joe Average 是一名软件开发人员.他坚持认为他永远只需要英语,因此只想使用 ASCII.这对 用户 Joe 来说可能没问题,但对于 软件开发人员 Joe 来说就不行了.世界上大约有一半的人使用非拉丁字符,而使用 ASCII 可以说对这些人来说是不体贴的,而且最重要的是,他正在将他的软件关闭到一个庞大且不断增长的经济体中.

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

因此,需要一个包含所有语言的包含字符集.Unicode 就这样诞生了.它为每个字符分配一个唯一编号,称为代码点.Unicode 相对于其他可能的集合的一个优点是前 256 个代码点与 ISO-8859 相同-1,因此也是 ASCII.此外,在名为 基本多语言平面 (BMP).现在需要一个字符编码来访问这个字符集,正如问题所问,我将专注于UTF-8和UTF-16.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

那么有多少字节可以访问这些编码中的哪些字符?

So how many bytes give access to what characters in these encodings?

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆