检测 C/C++ 中字符串的编码 [英] Detect encoding of a string in C/C++
本文介绍了检测 C/C++ 中字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
给定一个指向字节数组(字符)的指针形式的字符串,我如何检测 C/C++ 中字符串的编码(我使用的是 Visual Studio 2008)??我进行了搜索,但大多数示例都是用 C# 完成的.
Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
谢谢
推荐答案
假设你知道输入数组的长度,你可以做出如下猜测:
Assuming you know the length of the input array, you can make the following guesses:
- 首先,检查前几个字节是否匹配任何众所周知的字节顺序标记 (BOM)对于 Unicode.如果他们这样做,你就完成了!
- 接下来,在最后一个字节之前搜索 ' '.如果您找到了,您可能正在处理 UTF-16 或 UTF-32.如果您发现多个连续的 ' ',则可能是 UTF-32.
- 如果任何字符是从
0x80
到0xff
,那肯定不是 ASCII 或 UTF-7.如果您将输入限制为 Unicode 的某些变体,则可以假设它是 UTF-8.否则,您必须进行一些猜测以确定它是哪个多字节字符集.那不会很有趣. - 此时它是:ASCII、UTF-7、Base64 或 UTF-16 或 UTF-32 的范围,恰好不使用最高位并且没有任何空字符.
- First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
- Next, search for ' ' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive ' 's, it's probably UTF-32.
- If any character is from
0x80
to0xff
, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun. - At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
这篇关于检测 C/C++ 中字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文