有效的方法来计算字符的字节长度，具体取决于编码 [英] Efficient way to calculate byte length of a character, depending on the encoding

查看：163 发布时间：2016/11/18 16:07:10 java character-encoding character byte

本文介绍了有效的方法来计算字符的字节长度，具体取决于编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

计算字符的字节长度，考虑字符编码，最有效的方法是什么？编码仅在运行时期间是已知的。在UTF-8中，例如，字符具有可变字节长度，因此每个字符需要单独确定。现在我想出了这个：

  char c = getCharSomehow（）; 
 String encoding = getEncodingSomehow（）; 
 // ... 
 int length = new String（new char [] {c}）。getBytes（encoding）.length;

但是这在一个循环中是笨拙和低效的，因为 new String 需要每次都创建。我在Java API中找不到其他更高效的方法。有 String＃valueOf（char），但根据它的来源它基本上与上面相同。我想象这可以通过像位移位这样的逐位操作来完成，但这是我的弱点，我不知道如何在这里考虑编码）。

 
 
  sub> 如果您对此有任何疑问，请检查此主题。  
    来自 @ Bkkbrad 在技术上是最高效的：
  char c = getCharSomehow 
 String encoding = getEncodingSomehow（）; 
 CharsetEncoder encoder = Charset.forName（encoding）.newEncoder（）; 
 // ... 
 int length = encoder.encode（CharBuffer.wrap（new char [] {c}））。 
  
但是 @ Stephen C 指出，这里还有更多的问题。例如，可以存在需要考虑的组合/替代字符。但这是另一个需要在之前的步骤中解决的问题。
解决方案
  CharsetEncoder 并重复使用 CharBuffer 作为输入，并且 ByteBuffer 作为输出。
 
 
 在我的系统上，以下代码需要25秒来编码100,000个单字符：
  Charset utf8 = Charset .forName（UTF-8）; 
 char [] array = new char [1]; 
 for（int reps = 0; reps <10000; reps ++）{
 for（array [0] = 0; array [0] <10000; array [0] ++）{
 int len = new String（array）.getBytes（utf8）.length; 
} 
} 
  
但是，以下代码在4秒：
  Charset utf8 = Charset.forName（UTF-8）; 
 CharsetEncoder encoder = utf8.newEncoder（）; 
 char [] array = new char [1]; 
 CharBuffer input = CharBuffer.wrap（array）; 
 ByteBuffer output = ByteBuffer.allocate（10）; 
 for（int reps = 0; reps <10000; reps ++）{
 for（array [0] = 0; array [0] <10000; array [0] ++）{
 output.clear（）; 
 input.clear（）; 
 encoding.encode（input，output，false）; 
 int len = output.position（）; 
} 
} 
  
 编辑：为什么仇敌们不喜欢？ 
 
 
 以下是一个从CharBuffer中读取的解决方案，它记录了代理对：
  Charset utf8 = Charset.forName（UTF- ）; 
 CharsetEncoder encoder = utf8.newEncoder（）; 
 CharBuffer input = //以某种方式分配，或者作为参数传递
 ByteBuffer output = ByteBuffer.allocate（10）; 
 
 int limit = input.limit（）; 
 while（input.position（）< limit）{
 output.clear（）; 
 input.mark（）; 
 input.limit（Math.max（input.position（）+ 2，input.capacity（）））; 
 if（Character.isHighSurrogate（input.get（））&&！Character.isLowSurrogate（input.get（）））{
 //错误的代理对;做一点事！ 
} 
 input.limit（input.position（））; 
 input.reset（）; 
 encoder.encode（input，output，false）; 
 int encodedLen = output.position（）; 
} 
  
 
What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;
But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

_{If you question the need for this, check this topic.}



Update: the answer from @Bkkbrad is technically the most efficient:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();
However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.
 解决方案 
Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}
Edit:  Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}


                        
这篇关于有效的方法来计算字符的字节长度，具体取决于编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

有效的方法来计算字符的字节长度，具体取决于编码 [英] Efficient way to calculate byte length of a character, depending on the encoding

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

有效的方法来计算字符的字节长度，具体取决于编码 [英] Efficient way to calculate byte length of a character, depending on the encoding

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭