如何检查Java中字节数组是否包含Unicode字符串? [英] How can I check whether a byte array contains a Unicode string in Java?

查看:175
本文介绍了如何检查Java中字节数组是否包含Unicode字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个字节数组,该数组可以是UTF-8编码的字符串,也可以是任意二进制数据,那么在Java中可以使用哪种方法 来确定它是哪个?

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

该数组可以由类似于以下代码的代码生成:

The array may be generated by code similar to:

byte[] utf8 = "Hello World".getBytes("UTF-8");

或者,它可能是由类似于以下代码的代码生成的:

Alternatively it may have been generated by code similar to:

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

关键点是我们不知道数组包含什么,但是需要找出以便填写以下函数:

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

如何将其扩展到涵盖UTF-16或其他编码机制?

How would this be extended to also cover UTF-16 or other encoding mechanisms?

推荐答案

在所有情况下都不可能完全准确地做出该决定,因为UTF-8编码的字符串的一种任意类型二进制数据,但是您可以查找在UTF-8中无效的字节序列 .如果找到任何内容,您就会知道它不是UTF-8.

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

如果数组足够大,则应该可以很好地解决问题,因为这样的序列很可能会出现在随机"二进制数据中,例如压缩数据或图像文件.

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

但是,有可能获得有效的UTF-8数据,该数据解码为完全没有意义的字符串(可能来自各种不同的脚本).对于短序列,这更有可能.如果您对此感到担心,则可能必须进行仔细分析,以查看字母字符是否都属于同一个

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

这篇关于如何检查Java中字节数组是否包含Unicode字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆