有效的方法来计算字符的字节长度,具体取决于编码 [英] Efficient way to calculate byte length of a character, depending on the encoding

查看:163
本文介绍了有效的方法来计算字符的字节长度,具体取决于编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

计算字符的字节长度,考虑字符编码,最有效的方法是什么?编码仅在运行时期间是已知的。在UTF-8中,例如,字符具有可变字节长度,因此每个字符需要单独确定。现在我想出了这个:

  char c = getCharSomehow(); 
String encoding = getEncodingSomehow();
// ...
int length = new String(new char [] {c})。getBytes(encoding).length;

但是这在一个循环中是笨拙和低效的,因为 new String 需要每次都创建。我在Java API中找不到其他更高效的方法。有 String#valueOf(char) ,但根据它的来源它基本上与上面相同。我想象这可以通过像位移位这样的逐位操作来完成,但这是我的弱点,我不知道如何在这里考虑编码)。



sub> 如果您对此有任何疑问,请检查此主题

来自 @ Bkkbrad 在技术上是最高效的:

  char c = getCharSomehow 
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char [] {c}))。

但是 @ Stephen C 指出,这里还有更多的问题。例如,可以存在需要考虑的组合/替代字符。但这是另一个需要在之前的步骤中解决的问题。

解决方案

CharsetEncoder 并重复使用 CharBuffer 作为输入,并且 ByteBuffer 作为输出。



在我的系统上,以下代码需要25秒来编码100,000个单字符:

  Charset utf8 = Charset .forName(UTF-8); 
char [] array = new char [1];
for(int reps = 0; reps <10000; reps ++){
for(array [0] = 0; array [0] <10000; array [0] ++){
int len = new String(array).getBytes(utf8).length;
}
}

但是,以下代码在4秒:

  Charset utf8 = Charset.forName(UTF-8); 
CharsetEncoder encoder = utf8.newEncoder();
char [] array = new char [1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for(int reps = 0; reps <10000; reps ++){
for(array [0] = 0; array [0] <10000; array [0] ++){
output.clear();
input.clear();
encoding.encode(input,output,false);
int len = output.position();
}
}

编辑:为什么仇敌们不喜欢?



以下是一个从CharBuffer中读取的解决方案,它记录了代理对

  Charset utf8 = Charset.forName(UTF- ); 
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //以某种方式分配,或者作为参数传递
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position()< limit){
output.clear();
input.mark();
input.limit(Math.max(input.position()+ 2,input.capacity()));
if(Character.isHighSurrogate(input.get())&&!Character.isLowSurrogate(input.get())){
//错误的代理对;做一点事!
}
input.limit(input.position());
input.reset();
encoder.encode(input,output,false);
int encodedLen = output.position();
}


What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

If you question the need for this, check this topic.


Update: the answer from @Bkkbrad is technically the most efficient:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

解决方案

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

这篇关于有效的方法来计算字符的字节长度,具体取决于编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆