使用C / C ++检测字符串的编码 [英] Detect encoding of a string in C/C++
本文介绍了使用C / C ++检测字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
给定一个字符串形式的指向字节数组(chars)的指针,如何检测C / C ++中的字符串的编码(我使用visual studio 2008)?我做了一个搜索,但大多数样品是在C#。
Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
感谢
推荐答案
假设你知道输入数组的长度,您可以进行以下猜测:
Assuming you know the length of the input array, you can make the following guesses:
- 首先,检查前几个字节是否匹配任何知道 Unicode的字节顺序标记(BOM)。
- 接下来,在最后一个字节之前搜索'\0'。如果你找到一个,你可能会处理UTF-16或UTF-32。
- 如果任何字符来自
0x80
到0xff
,它当然不是ASCII或UTF-7。如果您将输入限制为Unicode的某个变体,则可以假定其为UTF-8。否则,您必须做一些猜测,以确定它是哪个多字节字符集。这不会很有趣。 - 此时,它是:ASCII,UTF-7,Base64或UTF-16或UTF-32的范围,刚好没有使用顶部位,并且没有任何空字符。
- First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
- Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
- If any character is from
0x80
to0xff
, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun. - At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
这篇关于使用C / C ++检测字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文