Java中的4字节unicode字符 [英] 4 byte unicode character in Java
问题描述
我正在为我的自定义StringDatatype编写单元测试,并且需要写下4字节的unicode字符."\ U"-不起作用(非法转义字符错误)例如:U + 1F701(0xf0 0x9f 0x9c 0x81).如何将其写成字符串?
Unicode代码点不是4个字节;它不是4个字节.它是一个整数(目前从U + 0000到U + 10FFFF).
您的4个字节是(很自然地)其UTF-8编码版本(;如果您的计算环境支持,则直接作为符号.
另请参见 CharsetDecoder
和 CharsetEncoder
类.
另请参见 String.codePointCount()
,以及从Java 8开始的 String.codePoints()
(继承自 CharSequence
)./p>
I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char
is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char
is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01"
-- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder
and CharsetEncoder
classes.
See also String.codePointCount()
, and, since Java 8, String.codePoints()
(inherited from CharSequence
).
这篇关于Java中的4字节unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!