说“Java Modified UTF-8 Encoding”是什么意思? [英] What does it mean to say "Java Modified UTF-8 Encoding"?

查看:346
本文介绍了说“Java Modified UTF-8 Encoding”是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说Java Modified UTF-8 Encoding是什么意思?它与普通的UTF-8编码有什么不同?

What does it mean to say "Java Modified UTF-8 Encoding" ? How is it different from normal UTF-8 Encoding?

推荐答案

这在 DataInput


修改后的UTF-8



DataInput DataOutput <的实现/ code> interfaces表示Unicode字符串,其格式略微修改为UTF-8。 (有关标准UTF-8格式的信息,请参阅 3.9 Unicode编码形式的Unicode标准版本4.0 )。请注意,在下表中,最重要的位出现在最左侧的列中。

Modified UTF-8

Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8. (For information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0). Note that in the following tables, the most significant bit appears in the far left-hand column.

...(某些表,请单击javadoc链接看看自己... ...

此格式与标准UTF-8格式之间的差异如下:

The differences between this format and the standard UTF-8 format are the following:


  • 空字节'\ u0000'以2字节格式而不是1字节编码,因此编码的字符串永远不会有嵌入的空值。

  • 只使用1字节,2字节和3字节格式。

  • 补充字符以代理对的形式表示。

  • The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.
  • Only the 1-byte, 2-byte, and 3-byte formats are used.
  • Supplementary characters are represented in the form of surrogate pairs.

如何读取它在 的DataInput#的readUTF()

How to read it is described in detail in the javadoc of DataInput#readUTF():


readUTF



readUTF

String readUTF()
           throws IOException

读入一个使用修改的UTF-8 格式。 readUTF 的一般合同是它读取以修改的UTF-8格式编码的Unicode字符串的表示形式;然后将此字符串作为 String 返回。

Reads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String.

首先,读取两个字节并用于构造完全符合 readUnsignedShort 方法的无符号16位整数。此整数值称为 UTF长度,并指定要读取的其他字节数。然后通过将它们分组考虑将这些字节转换为字符。每组的长度根据组的第一个字节的值计算。组之后的字节(如果有)是下一组的第一个字节。

First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.

如果组的第一个字节与位模式匹配 0xxxxxxx (其中 x 表示可能 0 1 ),然后该组只包含该字节。字节被零扩展以形成字符。

If the first byte of a group matches the bit pattern 0xxxxxxx (where x means "may be 0 or 1"), then the group consists of just that byte. The byte is zero-extended to form a character.

如果组的第一个字节与位模式匹配 110xxxxx ,然后该组由该字节 a 和第二个字节 b 组成。如果没有字节 b (因为字节 a 是要读取的最后一个字节),或者是字节 b 与位模式 10xxxxxx 不匹配,然后是 UTFDataFormatException 被抛出。否则,该组将转换为字符:

If the first byte of a group matches the bit pattern 110xxxxx, then the group consists of that byte a and a second byte b. If there is no byte b (because byte a was the last of the bytes to be read), or if byte b does not match the bit pattern 10xxxxxx, then a UTFDataFormatException is thrown. Otherwise, the group is converted to the character:

(char)(((a& 0x1F) << 6) | (b & 0x3F))

如果组的第一个字节与位模式匹配 1110xxxx ,然后该组由该字节 a 组成,还有两个字节 b c 。如果没有字节 c (因为字节 a 是要读取的最后两个字节之一),或字节 b 或字节 c 与位模式 10xxxxxx ,然后抛出 UTFDataFormatException 。否则,该组将转换为字符:

If the first byte of a group matches the bit pattern 1110xxxx, then the group consists of that byte a and two more bytes b and c. If there is no byte c (because byte a was one of the last two of the bytes to be read), or either byte b or byte c does not match the bit pattern 10xxxxxx, then a UTFDataFormatException is thrown. Otherwise, the group is converted to the character:

(char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))

如果组的第一个字节匹配模式 1111xxxx 或模式 10xxxxxx ,那么抛出UTFDataFormatException

If the first byte of a group matches the pattern 1111xxxx or the pattern 10xxxxxx, then a UTFDataFormatException is thrown.

如果在整个过程中任何时候遇到文件结尾,那么抛出EOFException

If end of file is encountered at any time during this entire process, then an EOFException is thrown.

通过此过程将每个组转换为字符后,将按照与此相同的顺序收集字符。从输入流中读取相应的组,以形成 String ,并返回。

After every group has been converted to a character by this process, the characters are gathered, in the same order in which their corresponding groups were read from the input stream, to form a String, which is returned.

writeUTF 接口 DataOutput 的方法可用于写入适合此方法读取的数据。

The writeUTF method of interface DataOutput may be used to write data that is suitable for reading by this method.

这篇关于说“Java Modified UTF-8 Encoding”是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆