检测 C/C++ 中字符串的编码 [英] Detect encoding of a string in C/C++

查看:24
本文介绍了检测 C/C++ 中字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个指向字节数组(字符)的指针形式的字符串,我如何检测 C/C++ 中字符串的编码(我使用的是 Visual Studio 2008)??我进行了搜索,但大多数示例都是用 C# 完成的.

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

谢谢

推荐答案

假设你知道输入数组的长度,你可以做出如下猜测:

Assuming you know the length of the input array, you can make the following guesses:

  1. 首先,检查前几个字节是否匹配任何众所周知的字节顺序标记 (BOM)对于 Unicode.如果他们这样做,你就完成了!
  2. 接下来,在最后一个字节之前搜索 ''.如果您找到了,您可能正在处理 UTF-16 或 UTF-32.如果您发现多个连续的 '',则可能是 UTF-32.
  3. 如果任何字符是从 0x800xff,那肯定不是 ASCII 或 UTF-7.如果您将输入限制为 Unicode 的某些变体,则可以假设它是 UTF-8.否则,您必须进行一些猜测以确定它是哪个多字节字符集.那不会很有趣.
  4. 此时它是:ASCII、UTF-7、Base64 或 UTF-16 或 UTF-32 的范围,恰好不使用最高位并且没有任何空字符.
  1. First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
  2. Next, search for '' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive ''s, it's probably UTF-32.
  3. If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
  4. At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.

这篇关于检测 C/C++ 中字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆