有没有办法在 C++ 中检查字符串是否包含 unicode 字符 [英] Is there a way to check whether a string contains unicode characters in C++

查看:68
本文介绍了有没有办法在 C++ 中检查字符串是否包含 unicode 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法在C++中检查一个字符串是否包含unicode字符

Is there a way to check whether a string contains unicode characters in C++

我有一个字符串,我需要验证它是否包含 unicode(UTF-8 或 UTF-16)字符.如果是这样,我需要将它们转换为 ASCII.我对转换逻辑有一些想法.但需要一些帮助来检测字符串中的 unicode 字符

I have a string and I need to validate whether it contains unicode(UTF-8 or UTF-16) characters. If it does I need to convert them to ASCII. I have some idea about the conversion logic., but need some help in detecting the unicode characters in the string

推荐答案

没有 100% 保证的解决方案.我会先阅读前 100 个大约字节,并尝试确定编码:

There's no 100% guaranteed solution. I'd start by reading the first 100 or so bytes, and try to determine the encoding:

  • 如果文件以三字节序列 0xEF, 0xBB, 0xBF 开头,则是可能是UTF-8.在这种情况下,删除这三个,并将其余的处理为UTF-8,如下.

  • If the file starts with the three byte sequence 0xEF, 0xBB, 0xBF, it's probably UTF-8. In this case, drop these three, and process the rest as UTF-8, below.

如果文件以两个字节序列 0xFE, 0xFF 开头,则可能是UTF16BE.删除这两个,并将其余部分作为 UTF16BE 处理,如下所示.

If the file starts with the two byte sequence 0xFE, 0xFF, it's probably UTF16BE. Drop these two, and process the rest as UTF16BE, below.

如果文件以两个字节序列 0xFF, 0xFE 开头,则为可能是UTF16LE.删除这两个,并将其余的处理为 UTF16LE,下面.

If the file starts with the two byte sequence 0xFF, 0xFE, it's probably UTF16LE. Drop these two, and process the rest as UTF16LE, below.

如果每隔一个字节,从第一个开始,大部分都是 0,那么文件可能是 UTF16BE.(多少主要取决于;取决于数据来源,甚至不止一对就足够了.)处理为 UTF16BE,如下所示.

If every other byte, starting with the first, are mostly 0, then the file is probably UTF16BE. (How much is mostly depends; depending on the source of the data, even more than a couple could be sufficient.) Process as UTF16BE, below.

如果每隔一个字节,从第二个开始,大部分都是 0,那就是可能是 UTF16LE(在 Windows 世界中非常常见).

If every other byte, starting with the second, are mostly 0, the it's probably UTF16LE (very frequent in the Windows world).

否则,这是任何人的猜测,但像处理 UTF-8 一样处理它(不丢弃任何字节)可能是可以接受的.

Otherwise, it's anyone's guess, but processing it as if it were UTF-8 (without dropping any bytes) is probably acceptable.

至于如何处理文件:

  • 对于 UTF-8,只需检查所有剩余字节是否在范围内[0,128).如果不是,则无法将文件转换为 ASCII.如果他们是,文件 ASCII(以及UTF-8).这也有效对于大多数单字节编码,例如所有 ISO-8859 编码(仍然很普遍).

  • For UTF-8, just check that all of the remaining bytes are in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, the file is ASCII (as well as being UTF-8). This is also valid for most single byte encodings, e.g. all of the ISO-8859 encodings (which are still widespread).

对于 UTF16BE,每隔一个字节,从第一个开始,应该是 0,并且[0,128] 范围内的剩余字节).如果不是,则文件不能转换为 ASCII.如果是,则每隔一个字节取一次,从第二个.

For UTF16BE, every other byte, starting at the first, should be 0, and the remaining bytes in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, take every other byte, starting at the second.

对于 UTF16LE,每隔一个字节,从第二个开始,应该是 0,以及范围 [0,128) 中的剩余字节.如果不是,则文件无法转换为 ASCII.如果是,则每隔一个字节取一次,从第一个开始.

For UTF16LE, every other byte, starting at the second, should be 0, and the remaining bytes in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, take every other byte, starting at the first.

在所有情况下,此处理开始之后从第一步.

In all cases, this processing starts after dropping any bytes from the first step.

最后,你没有说出你想要做什么.有编码允许以纯形式表示所有 Unicode 字符的约定ASCII;如果您生成的 ASCII 将由期望一个的代码处理这些约定,那么你将不得不处理完整的 Unicode(包括 UTF-16 中的代理对)并将 Unicode 转换为目标程序期望的任何编码.以 C++ 为例,期望通用字符名称;é 的表示,对于例如,将是 \u00E9.这意味着您还必须转换 \\\.(据我所知,这个约定只适用于编程语言,如 C、C++ 和 Java.)

Finally, you don't say what you are trying to do. There are encoding conventions which allow representing all Unicode characters in pure ASCII; if the ASCII you generate will be processed by code expecting one of these conventions, then you'll have to process the full Unicode (including surrogate pairs in the UTF-16) and convert the Unicode to whatever encoding the target program expects. C++, for example, expects universal character names; the representation for é, for example, would be \u00E9. Which means you'd also have to convert \ to \\. (As far as I know, this convention only applies to programming languages, like C, C++ and Java.)

这篇关于有没有办法在 C++ 中检查字符串是否包含 unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆