字符、代码点、字形和字素之间有什么区别? [英] What's the difference between a character, a code point, a glyph and a grapheme?

查看:55
本文介绍了字符、代码点、字形和字素之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图理解现代 Unicode 的微妙之处让我很头疼.尤其是代码点、字符、字形和字素之间的区别——在最简单情况下,当使用 ASCII 字符处理英文文本时,这些概念之间都具有一对一的关系- 给我带来了麻烦.

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.

了解这些术语如何在 Matthias Bynens 的JavaScript 存在 unicode 问题 等文档中使用,或维基百科关于汉族统一的文章,我认为这些概念不是一回事,而是将它们混为一谈很危险,但我有点难以理解每个术语的含义.

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

Unicode Consortium 提供了一个 glossary 来解释这些东西,但它充满了这样的定义":

The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:

抽象字符.用于组织、控制或表示文本数据的信息单元....

Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...

...

字符.... (2) 抽象字符的同义词.(3) Unicode 字符编码的基本编码单位....

Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...

...

字形.(1) 代表一个或多个字形图像的抽象形式.(2) 字形图像的同义词.在显示 Unicode 字符数据时,可以选择一个或多个字形来描述特定字符.

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

...

字素.(1) 在特定书写系统的上下文中的最小独特的书写单位....

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

这些定义中的大多数具有听起来非常学术和正式的品质,但缺乏任何意义的品质,或者将定义问题推迟到另一个词汇表条目或标准部分.

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

所以我寻求比我更博学的人的奥秘.这些概念中的每一个究竟有什么不同,在什么情况下它们之间不会有一对一的关系?

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

推荐答案

  • Character 是一个有很多含义的术语.

    • Character is an overloaded term that can mean many things.

      代码点是信息的原子单位.Text 是一个代码点序列.每个码位都是一个数字,Unicode 标准赋予其含义.

      A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

      代码单元是编码代码点的部分的存储单元.在 UTF-8 中这意味着 8 位,在 UTF-16 中这意味着 16 位.单个代码单元可以表示完整的代码点或代码点的一部分.例如,雪人字形 () 是单个代码点,但有 3 个 UTF-8 代码单元和 1 个 UTF-16 代码单元.

      A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph () is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

      字素是一个或多个代码点的序列,这些代码点显示为单个图形单元,读者将其识别为书写系统的单个元素.例如,aä 都是字素,但它们可能由多个代码点组成(例如 ä 可能是两个代码点,一个用于基本字符 a 后跟一个用于分音符;但还有一个替代的、遗留的、单一代码点代表这个字素).某些代码点从不属于任何字素(例如,零宽度非连接器或方向覆盖).

      A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

      字形是一种图像,通常存储在字体(字形的集合)中,用于表示字素或其部分.字体可以将多个字形组合成单个表示,例如,如果上述 ä 是单个代码点,则字体可以选择将其呈现为两个独立的、空间重叠的字形.对于 OTF,字体的 GSUB 和 GPOS 表包含替换和定位信息来完成这项工作.一种字体也可能包含同一个字素的多个替代字形.

      A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

      这篇关于字符、代码点、字形和字素之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆