检测 UTF-8 编码(MS IDE 是如何做到的)? [英] Detect UTF-8 encoding (How does MS IDE do it)?

查看:29
本文介绍了检测 UTF-8 编码(MS IDE 是如何做到的)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

各种字符编码的一个问题是包含文件并不总是被清楚地标记.使用字节顺序标记"或 BOM 标记某些内容的约定不一致.但本质上,您必须告诉文件编码是什么,才能准确读取它.

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.

我们构建了读取源文件的编程工具,这让我们很伤心.我们有办法指定默认值,并嗅探 BOM 等.而且我们在约定和默认值方面做得很好.但是我们(我假设其他人)会挂断的地方是没有 BOM 标记的 UTF-8 文件.

We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.

最近的 MS IDE(例如,VS Studio 2010)显然会嗅探"文件以确定它是否是 UTF-8 编码而没有 BOM.(在工具业务中,我们希望与 MS 兼容,因为它们的市场份额,即使这意味着必须与他们一起越过愚蠢"的悬崖.)我对他们使用什么特别感兴趣启发式(尽管启发式的讨论很好)?怎么可能是对的"?(考虑以这种方式解释的 ISO8859-x 编码字符串).

Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).

这篇关于检测字符编码/集的论文非常有趣:http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

This paper on detecting character encodings/sets is pretty interesting: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

2012 年 12 月我们结束了对整个文件的扫描,以查看它是否包含任何违反 UTF-8 序列的内容……如果没有,我们将其称为 UTF-8.该解决方案的坏处是,如果字符是 UTF-8,则必须处理两次字符.(如果不是 UTF-8,这个测试可能会很快确定,除非文件发生在所有 7 位 ASCII 上,此时像 UTF-8 一样读取不会受到伤害).

EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).

推荐答案

如果编码是 UTF-8,你在 0x7F 上看到的第一个字符必须是 UTF-8 序列的开始.所以测试它.这是我们使用的代码:

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

unc ::IsUTF8(unc *cpt)
{
    if (!cpt)
        return 0;

    if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80)
         && ((*(cpt + 3) & 0xC0) == 0x80))
            return 4;
    }
    else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80))
            return 3;
    }
    else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
        if ((*(cpt + 1) & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

如果返回 0,则它不是有效的 UTF-8.否则跳过返回的字符数并继续检查下一个超过 0x7F 的字符.

If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

这篇关于检测 UTF-8 编码(MS IDE 是如何做到的)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆