UTF-8是编码还是字符集? [英] Is UTF-8 an encoding or a character set?

查看:215
本文介绍了UTF-8是编码还是字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为字符集的名称是Unicode,UTF-8是Unicode字符集的特定编码的名称,但是我经常看到使用的术语编码和字符集引用UTF-8时可以互换。



例如,

  < meta charset =UTF-8> 

vs

 <?xml version =1.0encoding =UTF-8?> 


解决方案


是UTF-8一个编码或一个字符集?


UTF-8是一种编码,该术语用于定义它的RFC,它被引用







我经常看到术语编码和charset可互换使用


在Unicode之前,如果要使用像西里尔文或希腊文的字母†,则需要使用仅编码为字符的编码那个字母表。因此,编码和字符集的术语通常是混合的,但它们意味着不同的东西。



现在,Unicode是通常是您需要担心的唯一字符集,因为除了克林贡以外,它包含大部分书面语言的字符。




† - 字母,一种字符集,其中字符直接对应于口语中的声音。






字符集是从代码单元(整数)以书面语言的字符,符号,字形或其他标记。 Unicode是将21b整数映射到unicode代码点的字符集。 Unicode Consortium的词汇表因此描述:


Unicode




  1. 用于编写所有世界语言的字符的数字表示的标准。 Unicode提供了一种统一的方式来存储,搜索和交换任何语言的文本。它被所有现代电脑所使用,是处理Internet上文字的基础。 Unicode由Unicode Consortium开发和维护: http://www.unicode.org

  2. 应用于由Unicode Consortium开发和维护的软件国际化和本地化标准的标签。







编码是从字符串到字符串的映射。 UTF-8是将字节串(8b个整数)映射到码点字符串(21b整数)的编码。 Unicode Consortium将其称为字符编码方案,它在 RFC 3629


然而,UCS最初提出的编码是
与许多当前的应用程序和协议不兼容,这个
导致开发UTF-8



I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.

For example,

<meta charset="UTF-8">

vs

<?xml version="1.0" encoding="UTF-8" ?>

解决方案

Is UTF-8 an encoding or a character set?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.


I often see the terms "encoding" and "charset" used interchangeably

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.

† - Alphabet, a kind of character set where characters correspond directly to sounds in a spoken language.


A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:

Unicode

  1. The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
  2. A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.


An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.

The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8

这篇关于UTF-8是编码还是字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆