如何使用BOM对UTF-16LE字节数组进行编码/解码? [英] How do I encode/decode UTF-16LE byte arrays with a BOM?

查看:490
本文介绍了如何使用BOM对UTF-16LE字节数组进行编码/解码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在 java.lang.String 中编码/解码UTF-16字节数组。字节数组是通过字节订单标记(BOM)提供给我的,我需要带有BOM的带编码的字节数组。

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

另外,因为我正在处理一个Microsoft客户端/服务器,所以我想用小端发出编码(沿着与LE BOM)避免任何误解。我确实意识到,使用BOM它应该工作大端,但我不想在Windows世界游泳上游。

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

作为一个例子,这是一个方法它以 java.lang.String UTF-16 编码为带有BOM的小端:

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

在Java中执行此操作的最佳方法是什么?理想情况下,我希望避免将整个字节数组复制到一个新的字节数组中,该数组在开头分配了两个额外的字节。

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

解码这样的字符串也是如此,但通过使用 java.lang.String 构造函数

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)


推荐答案

UTF-16字符集名称将始终使用BOM和将使用大/小字节顺序解码数据,但UnicodeBig和UnicodeLittle对于按特定字节顺序进行编码非常有用。使用UTF-16LE或UTF-16BE无BOM - 请参阅此帖子,了解如何使用\ uFEFF手动处理物料清单。有关charset的规范命名,请参见此处字符串名称或(最好) Charset 上课。另请注意,只有有限的子集编码绝对需要得到支持。

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

这篇关于如何使用BOM对UTF-16LE字节数组进行编码/解码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆