ED A0 80 ED B0 80是一个有效的UTF-8字节序列吗? [英] Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?
问题描述
java.nio.charset.Charset.forName(utf8)。decode 解码字节序列
ED A0 80 ED B0 80
进入Unicode代码点:
into the Unicode codepoint:
U+10000
java。 nio.charset.Charset.forName(utf8)。decode 还解码字节序列
F0 90 80 80
进入Unicode代码点:
into the Unicode codepoint:
U+10000
这是由以下代码。
现在这似乎在告诉我UTF-8编码方案将解码 ED A0 80 ED B0 80
和 F0 90 80 80
进入相同的unicode代码点。
Now this seems to be telling me that the UTF-8 encoding scheme will decode ED A0 80 ED B0 80
and F0 90 80 80
into the same unicode codepoint.
但是,如果我访问 https:// www .google.com / search?query = %ED%A0%80%ED%B0%80 ,
However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80,
我可以请注意它与 https:// www。页面明显不同。 google.com/search?query= %F0%90%80%80
I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80
由于Google搜索使用的是UTF- 8编码方案(如果我错了也纠正我),
Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,
这表明UTF-8无法解码 ED A0 80 ED B0 80
和 F0 90 80 80
进入相同的unicode代码点。
This suggests that the UTF-8 does not decode ED A0 80 ED B0 80
and F0 90 80 80
into the same unicode codepoint(s).
所以基本上我想知道,按照官方标准,UTF-8解码 ED A0 80 ED B0 80
字节序列到Unicode代码点U +10000?
So basically I was wondering, by the official standard, should UTF-8 decode ED A0 80 ED B0 80
byte sequence into the Unicode codepoint U+10000 ?
Co de :
public class Test {
public static void main(String args[]) {
java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
for (int x = 0, xx = cb.limit(); x < xx; ++x) {
System.out.println(Integer.toHexString(cb.get(x)));
}
System.out.println();
bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
cb = java.nio.charset.Charset.forName("utf8").decode(bb);
for (int x = 0, xx = cb.limit(); x < xx; ++x) {
System.out.println(Integer.toHexString(cb.get(x)));
}
}
}
推荐答案
ED A0 80 ED B0 80
是UTF-16代理对的UTF-8编码 D800 DC00
。 UTF-8 允许 NOT :
ED A0 80 ED B0 80
is the UTF-8 encoding of the UTF-16 surrogate pair D800 DC00
. This is NOT allowed in UTF-8:
但是,D800和DFFF之间的UCS-2值对(Unicode术语中的代理对
)......需要特殊处理: UTF-16
转换必须撤消,产生一个UCS-4字符,然后转换为
,如上所述。
However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.
但是,中使用了 这样的编码 CESU-8 和Java的修改过的UTF-8。
However, such an encoding is used in CESU-8 and Java's "Modified UTF-8".
由于Google搜索使用的是UTF-8编码方案(如果我错了也纠正我),
Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,
这篇关于ED A0 80 ED B0 80是一个有效的UTF-8字节序列吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!