检查字节序列是否包含utf-16 [英] Check if byte sequence contains utf-16

查看:129
本文介绍了检查字节序列是否包含utf-16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从流中读取字节序列.为了便于讨论,假设序列的长度是固定的,我将整个序列读入字节数组(在我的情况下是vector<char>,但对于这个问题并不重要).这个字节序列包含一个字符串,可以是utf-16或utf-8编码.不幸的是,没有任何指示.

I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case it's vector<char> but it's not important for this question). This byte sequence contains a string, which my be either in utf-16 or in utf-8 encoding. Unfortunately, there's no indicator of which one it is.

我可以验证字节序列是否代表有效的utf-16编码,以及它是否代表有效的utf-8编码,但是我还可以对相同的字节序列如何构成有效的utf-8和有效的图像进行成像utf-16同时显示.

I can verify whether the byte sequence represents a valid utf-16 encoding and also whether it represents a valid utf-8 encoding, but I can also imaging how the same sequence of bytes may be a valid utf-8 and a valid utf-16 at the same time.

那么,这是否意味着无法通用地确定它是哪一个?

So, does that mean there's no way to generically figure out which one it is?

推荐答案

如果期望使用拉丁脚本以某种语言编写内容,则只需对空值进行计数即可检测到UTF-16.在UTF-8中,空字节将解码为NUL控制字符,并且它们通常不会出现在文本中.

If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.

用其他脚本编写的语言在UTF-16和UTF-8中均不能完全有效,除非人为地构造了这种语言.

Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.

因此,首先要自行检测它是否是完全有效的UTF-8序列:

So, first detect if it's fully valid UTF-8 sequence on its own:

  • 如果是,请检查是否为空字节,如果有,则为UTF-16.否则为UTF-8.
  • 如果不是,则为UTF-16.

如果上述结果导致产生UTF-16,那还远远不够,因为您还必须了解字节序.对于使用拉丁文字书写的语言,奇数或偶数字节的字节数可以说明这一点.

If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.

这篇关于检查字节序列是否包含utf-16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆