是UTF-8编码还是字符集? [英] Is UTF-8 an encoding or a character set?

查看:598
本文介绍了是UTF-8编码还是字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为字符集的名称是Unicode,并且UTF-8是Unicode字符集的特定编码的名称,但我经常看到使用术语encoding和charset

 可以互换使用UTF-8。



< < meta charset =UTF-8>

vs

 <?xml version =1.0encoding =UTF-8?> 


解决方案


是UTF-8编码或字符集?


UTF-8是一种编码,该术语在RFC中使用,







我经常看到encoding和charset / p>

在Unicode之前,如果你想使用字母表†像西里尔文或希腊语,你需要使用一种编码,在那个字母表。因此,编码字符集通常是混合的,但它们意味着不同的东西。



通常是唯一需要担心的字符集,因为它包含大多数书面语言的字符,除了Klingon。




† - 字母表,一种字符集,其中字符直接对应于口头语言的声音。






字符集是代码单位(整数)到字符,符号,字形或书面语言的其他标记。 Unicode是将21b个整数映射到unicode代码点的字符集。 Unicode Consortium的词汇表对此进行了描述:


Unicode




  1. 用于编写所有世界语言的字符的数字表示标准。 Unicode提供了以任何语言存储,搜索和互换文本的统一手段。它被所有现代计算机使用,是在互联网上处理文本的基础。 Unicode由Unicode Consortium开发和维护: http://www.unicode.org

  2. 适用于由Unicode Consortium开发和维护的软件国际化和本地化标准的标签。





b
$ b

是字符串到字符串的映射。 UTF-8是将字节串(8b整数)映射到代码点串(21b整数)的编码。 Unicode Consortium称之为字符编码方案,它在 RFC 3629


然而,最初提出的UCS编码,是
不兼容许多当前的应用程序和协议,这个
导致开发UTF-8



I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.

For example,

<meta charset="UTF-8">

vs

<?xml version="1.0" encoding="UTF-8" ?>

解决方案

Is UTF-8 an encoding or a character set?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.


I often see the terms "encoding" and "charset" used interchangeably

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.

† - Alphabet, a kind of character set where characters correspond directly to sounds in a spoken language.


A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:

Unicode

  1. The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
  2. A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.


An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.

The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8

这篇关于是UTF-8编码还是字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆