C ++如何检查文件字节顺序标记以获取是否为UTF-8? [英] C++ How to inspect file Byte Order Mark in order to get if it is UTF-8?

查看:111
本文介绍了C ++如何检查文件字节顺序标记以获取是否为UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何检查文件字节顺序标记以获取在C ++中是否为UTF-8吗?

I wonder how to inspect file Byte Order Mark in order to get if it is UTF-8 in C++?

推荐答案

通常,您不能.

字节顺序标记的存在非常有力地表明您正在读取的文件是Unicode.如果您希望使用文本文件,并且收到的前四个字节为:

The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff,  XX,   XX     -- The file is almost certainly UTF-16BE
0xff, 0xfe,  XX,   XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf,  XX   -- The file is almost certainly UTF-8 With a BOM

那还有什么呢?如果获得的字节不是这五个模式之一,则不能确定文件是否为UTF-8.

But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.

实际上,任何仅包含从0x00到0x7f的ASCII字符的文本文档都是有效的UTF-8文档,并且是纯ASCII文档.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.

有些启发式方法可以根据看到的特定字符来尝试推断文档是用ISO-8859-1还是UTF-8或CP1252编码的,但通常是第一个文件的两个,三个或四个字节不足以说明您正在查看的内容是否肯定是UTF-8.

There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.

这篇关于C ++如何检查文件字节顺序标记以获取是否为UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆