在Windows中从Windows 1252转换为UTF8:带有CharsetDecoder / Encoder的空字符 [英] Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

查看:1501
本文介绍了在Windows中从Windows 1252转换为UTF8:带有CharsetDecoder / Encoder的空字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这是一个很普遍的问题,但我变得生气了。



我使用这段代码:

  String ucs2Content = new String bufferToConvert,inputEncoding); 
byte [] outputBuf = ucs2Content.getBytes(outputEncoding);
return outputBuf;

但是我读的更好的是使用CharsetDecoder和CharsetEncoder(我有一些字符的内容可能在目的地编码)。我刚刚写了这段代码,但有一些问题:

  //为Win1252创建编码器和解码器
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

//将字节数组从开始的inputEncoding转换为UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

//将内部UCS2表达式转换为outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();

确实这段代码追加到缓冲区一个空字符序列! !!!



有人可以告诉我问题在哪里?我不熟练使用Java中的编码转换。



有没有更好的方法来转换Java中的编码?

解决方案

您的问题是 ByteBuffer.array()返回对用作ByteBuffer的后备存储的数组的直接引用,而不是返回数组有效范围的副本。你必须服从 bbuf.limit()(正如Peter在他的回复中所做的那样),只需使用索引 0 bbuf.limit() - 1



背景数组中额外的0值的原因对于由CharsetEncoder创建的ByteBuffer来说,这是一个很小的缺陷。每个CharsetEncoder具有每个字符的平均字节数,对于UCS2编码器来说,它似乎是简单和正确的(2字节/字符)。遵循这个固定值,CharsetEncoder最初分配一个ByteBuffer,其中string length * 20字节,长10字符的字符串。然而,UCS2 CharsetEncoder使用BOM(字节顺序标记)启动,它也占用2个字节,因此10个字符中只有9个字符适合分配的ByteBuffer。 CharsetEncoder检测到溢出并分配长度为2 * n + 1的新的ByteBuffer(n是ByteBuffer的原始长度),在这种情况下为2 * 20 + 1 = 41字节。由于21个新字节中只有2个需要编码剩余字符,所以从 bbuf.array()获取的数组的长度为41个字节,但 bbuf.limit()将指示仅实际使用前22个条目。


I know it's a very general question but I'm becoming mad.

I used this code:

String ucs2Content = new String(bufferToConvert, inputEncoding);        
        byte[] outputBuf = ucs2Content.getBytes(outputEncoding);        
        return outputBuf;

But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:

// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();

Indeed this code appends to the buffer a sequence of null character!!!!!

Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.

Is there a better way to convert encoding in Java?

解决方案

Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.

The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.

这篇关于在Windows中从Windows 1252转换为UTF8:带有CharsetDecoder / Encoder的空字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆