使用C / C ++检测字符串的编码 [英] Detect encoding of a string in C/C++

查看:476
本文介绍了使用C / C ++检测字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个字符串形式的指向字节数组(chars)的指针,如何检测C / C ++中的字符串的编码(我使用visual studio 2008)?我做了一个搜索,但大多数样品是在C#。

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

感谢

推荐答案

假设你知道输入数组的长度,您可以进行以下猜测:

Assuming you know the length of the input array, you can make the following guesses:


  1. 首先,检查前几个字节是否匹配任何知道 Unicode的字节顺序标记(BOM)

  2. 接下来,在最后一个字节之前搜索'\0'。如果你找到一个,你可能会处理UTF-16或UTF-32。

  3. 如果任何字符来自 0x80 0xff ,它当然不是ASCII或UTF-7。如果您将输入限制为Unicode的某个变体,则可以假定其为UTF-8。否则,您必须做一些猜测,以确定它是哪个多字节字符集。这不会很有趣。

  4. 此时,它是:ASCII,UTF-7,Base64或UTF-16或UTF-32的范围,刚好没有使用顶部位,并且没有任何空字符。

  1. First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
  2. Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
  3. If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
  4. At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.

这篇关于使用C / C ++检测字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆