在不实际编码的情况下计算 Java 字符串的 UTF-8 长度 [英] Calculating length in UTF-8 of Java String without actually encoding it
本文介绍了在不实际编码的情况下计算 Java 字符串的 UTF-8 长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
有谁知道标准 Java 库(任何版本)是否提供了一种无需实际生成编码输出即可计算字符串(在本例中特别是 UTF-8)的二进制编码长度的方法?换句话说,我正在寻找一个有效的等价物:
Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:
"some really long string".getBytes("UTF-8").length
我需要为可能很长的序列化消息计算一个长度前缀.
I need to calculate a length prefix for potentially long serialized messages.
推荐答案
这是一个基于 UTF-8 规范:
public class Utf8LenCounter {
public static int length(CharSequence sequence) {
int count = 0;
for (int i = 0, len = sequence.length(); i < len; i++) {
char ch = sequence.charAt(i);
if (ch <= 0x7F) {
count++;
} else if (ch <= 0x7FF) {
count += 2;
} else if (Character.isHighSurrogate(ch)) {
count += 4;
++i;
} else {
count += 3;
}
}
return count;
}
}
这个实现不能容忍格式错误的字符串.
This implementation is not tolerant of malformed strings.
这是用于验证的 JUnit 4 测试:
Here's a JUnit 4 test for verification:
public class LenCounterTest {
@Test public void testUtf8Len() {
Charset utf8 = Charset.forName("UTF-8");
AllCodepointsIterator iterator = new AllCodepointsIterator();
while (iterator.hasNext()) {
String test = new String(Character.toChars(iterator.next()));
Assert.assertEquals(test.getBytes(utf8).length,
Utf8LenCounter.length(test));
}
}
private static class AllCodepointsIterator {
private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
private static final int SURROGATE_FIRST = 0xD800;
private static final int SURROGATE_LAST = 0xDFFF;
private int codepoint = 0;
public boolean hasNext() { return codepoint < MAX; }
public int next() {
int ret = codepoint;
codepoint = next(codepoint);
return ret;
}
private int next(int codepoint) {
while (codepoint++ < MAX) {
if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
if (!Character.isDefined(codepoint)) { continue; }
return codepoint;
}
return MAX;
}
}
}
请原谅格式紧凑.
这篇关于在不实际编码的情况下计算 Java 字符串的 UTF-8 长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文