UTF-8中的代理字符是什么? [英] What are surrogate characters in UTF-8?

查看:230
本文介绍了UTF-8中的代理字符是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个奇怪的验证程序,用于验证utf-8字符串是否是有效的主机名(PHP中的Zend Framework主机名valdiator).它允许IDN(国际化域名).它将比较每个子域与由其十六进制字节表示形式定义的字符集.两个这样的集合是D800-DB7FDC00-DFFF.在这些比较期间,名为preg_match的php regexp比较函数失败,它表示此函数中不允许使用DC00-DFFF字符.从维基百科,我了解到这些字节在UTF-8中称为代理字符.什么是thay,它们实际上对应于哪些字符?我在几个地方读过书,但我仍然不明白它们是什么.

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.

推荐答案

UTF-8中的代理字符是什么?

What are surrogate characters in UTF-8?

这几乎像一个技巧问题.

This is almost like a trick question.

近似答案1:4个字节(如果已配对并以UTF-8编码).

Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).

近似答案2:无效(如果未配对).

Approximate answer #2: Invalid (if not paired).

大概答案3:它不是UTF-8;它不是UTF-8.这是修改后的UTF-8 .

Approximate answer #3: It's not UTF-8; It's Modified UTF-8.

简介:该术语不适用于UTF-8.

Synopsis: The term doesn't apply to UTF-8.

Unicode代码点的范围需要21位数据.

Unicode codepoints have a range that needs 21 bits of data.

UTF-16代码单元为16位. UTF-16将Unicode代码点的某些范围编码为一个代码单元,而另一些编码为成对的两个代码单元,第一个来自高"范围,第二个来自低"范围. Unicode将与高和低对的范围匹配的代码点保留为无效.它们有时被称为代理人,但不是字符.他们自己没有任何意义.

UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.

UTF-8代码单元为8位. UTF-8分别以一到四个代码单元对几个不同范围的代码点进行编码.

UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.

#1碰巧UTF-16用两个16位代码单元编码,UTF-8用4个8位代码单元编码,反之亦然.

#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.

#2可以将UTF-8编码算法应用于无效的代码点,该代码点无效.无法将它们解码为有效的代码点.符合要求的阅读器会抛出异常或抛出字节,然后插入替换字符( ).

#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).

#3 Java提供了一种通过名为JNI的系统在外部代码中实现功能的方法. Java String API提供对String和char作为UTF-16代码单元的访问.为了方便起见,在JNI的某些位置,字符串值是修改的UTF-8 .修改后的UTF-8是应用于UTF-16代码单元而不是Unicode代码点的UTF-8编码算法.

#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.

无论如何,字符编码的基本规则是使用用于写入的编码来读取.如果将字节的任何序列视为文本,则必须知道编码;否则,请参见编码.否则,您将丢失数据.

Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

这篇关于UTF-8中的代理字符是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆