为什么在Unicode中有重复的字符? [英] Why are there duplicate characters in Unicode?

查看:303
本文介绍了为什么在Unicode中有重复的字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在 Unicode 中看到一些重复的字符.例如,字符"C"可以由代码点U + 0043和U + 0421表示.为什么会这样?

I can see some duplicate characters in Unicode. For example, the character 'C' can be represented by the code points U+0043 and U+0421. Why is this so?

推荐答案

正如其他人所指出的那样,您的主要谬误是混淆了拉丁和西里尔字母以及其中的某些字形(即C(

As others have noted, your main fallacy here is confusing the Latin and Cyrillic scripts and some glyphs therein (namely C (U+0043 LATIN CAPITAL LETTER C) and С (U+0421 CYRILLIC CAPITAL LETTER ES) ). There are many such character pairs that look alike but are different characters. You will find plenty among Latin, Greek and Cyrillic, for example. Most of the time they only work in either uppercase or lowercase, though.

但是,实际上有 个重复项,有时是故意的.例如,整个(ASCII)拉丁字母在U + FF00和U + FFEF之间的半角和全角形式" Unicode块中再次表示两次.但是,还有其他这样的示例,最值得注意的是,在飞机1"上的数学字母部分中,存在三个或四个以上的拉丁字母.

However, there are in fact duplicates, sometimes intentionally so. For example, the entire (ASCII) Latin alphabet is represented twice again in the 'Halfwidth and Fullwidth Forms' Unicode block between U+FF00 and U+FFEF. There are other such examples, though, most notably in the mathematical alphabet section on Plane 1 where there are three or four more Latin alphabets present.

还有其他一些东西实际上是相同的字符,但是在不同的代码点.例如,有µ( U + 00B5 MICRO SIGN )和μ( U + 03BC希腊小写字母MU ).这些通常通过分解链接.

There are other things that are in fact the same character but at different code points. For example, there is µ (U+00B5 MICRO SIGN) and μ (U+03BC GREEK SMALL LETTER MU). Those are usually linked by decomposition.

Unicode处理称为 代码点 的抽象概念.代码点明确定义了一个字符及其脚本或组.它说没什么有关字体中相应的字形的呈现方式(对于拉丁语,这可能已经千差万别了).它还没有定义该代码点在文件或存储器中的表示方式(即字节序列).这是 Unicode转换格式 之一的工作./p>

Unicode deals with an abstract concept called code point. The code point unambiguously defines a character and its script or group. It says nothing about how the corresponding glyph in a font would be rendered (which may vary wildly for Latin already). It also does not define how this code point is represented in a file or memory (i.e. as a byte sequence). That's a job for one of the Unicode Transformation Formats.

在两种语言中使用不同代码点的外观相似的字符是什么原因?

What is reason to have a similar looking character in two languages with different code points?

这里Unicode的要点是:

The main points of Unicode here are:

  • 与每个以前存在的字符编码兼容.这必须确保在编码中使用的每个字符到与Unicode代码点直接等价的字符都有一对一的映射.
  • 忠实,准确地代表当今使用的每个脚本,然后扩展到正在使用并且需要存储在计算机系统中的其他脚本.
  • Compatibility to every previously existing character encoding. This has to ensure that there are one-to-one mappings for every character that was used in an encoding to a direct equivalent as a Unicode code point.
  • Faithfully and accurately represent every script that is used nowadays, later expanded to other scripts that were in use and need to be stored in computer systems.

因此强烈建议将脚本分开,并且不要尝试根据字符的外观映射字符.无论如何,外观可能很棘手.以西里尔字母т"为例,此处看起来像是较小的大写拉丁字母"T".但是,斜体显示的通常方式是:'т'看起来像小写的拉丁字母'm'.您真的不想按外观映射此类字符.

So there is a very strong incentive to keep scripts separate and not try to map characters according to their appearance. Appearance can be tricky anyway. Take for example the Cyrillic letter 'т', which appears like a smaller upper-case Latin 'T' here. However, the usual way it is rendered when italicized: 'т' looks like a lower-case Latin 'm'. You really don't want to map such characters by appearance.

这篇关于为什么在Unicode中有重复的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆