如何检测文本文件的字符编码​​? [英] How to detect the character encoding of a text file?

查看:163
本文介绍了如何检测文本文件的字符编码​​?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试检测其中的字符编码​​是在我的文件中使用。

I try to detect which character encoding is used in my file.

我这个code尝试获得标准编码

I try with this code to get the standard encoding

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

我的五个第一个字节是60,118,56,46和49。

My five first byte are 60, 118, 56, 46 and 49.

是否有表明其编码的这五个第一字节相匹配?图表

Is there a chart that shows which encoding matches those five first bytes?

推荐答案

您可以不依赖于有一个BOM文件。 UTF-8不需要它。与非统一code编码甚至没有一个BOM。然而还有,其他的方法来检测所述编码

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

BOM是00 00 FE FF(用于BE)或FF FE 00 00(用于LE)。

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

不过,UTF-32是很容易,即使没有BOM来检测。这是因为统一code code点范围被限制到U + 10FFFF,从而UTF-32的单位总是有图案00 {0X | 10} xx月xx(对于BE)或xx月xx {0X | 10} 00(用于LE)。如果数据的长度是4的倍数,并遵循以下模式之一,可以安全地假设它是UTF-32。误报由于面向字节编码00字节的稀有性几乎是不可能的。

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {0x|10} xx xx (for BE) or xx xx {0x|10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

没有BOM,但你并不需要一个。 ASCII可以通过缺乏80-FF范围的字节很容易识别。

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

BOM是EF BB BF。但是你不能靠这个。的UTF-8文件很多没有BOM,特别是如果他们起源于非Windows系统。

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

不过,你可以放心地假设,如果一个文件验证为UTF-8,它的的UTF-8。误报是罕见的。

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

具体地,给定该数据是不ASCII码,对于一个2字节的序列中的假阳性率只有3.9%(四万九千一百五十二分之一千九百二十○)。对于一个7字节的序列,这是小于1%。对于12字节的序列,这是小于0.1%。对于一个24字节序列,它在一万元是小于1。

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

BOM是FE FF(用于BE)或FF FE(对于LE)。需要注意的是UTF-16LE BOM是在UTF-32LE BOM开始发现的,所以检查UTF-32第一次。

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

有可能是UTF-16文件没有BOM,但它是真的很难检测到它们。识别UTF-16没有BOM的唯一可靠方法是找代理对(D [8-B] XX D [CF] XX),但非BMP字符也很少使用,使这种方法实用。

There may be UTF-16 files without a BOM, but it would be really hard to detect them. The only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

如果您的文件,字节3C 3F 78 6D 6C开始(即ASCII字符< XML),然后寻找一个编码= 声明。如果present,然后使用该编码。如果存在,则假定UTF-8,这是默认XML编码。

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

如果您需要支持EBCDIC,还找了相当于序列4C 6F A7 94 93。

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

在一般情况下,如果你有一个包含编码声明的文件格式,然后查找声明,而不是试图去猜测编码。

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

有数以百计的其他编码,这需要更多的努力来检测的。我建议尝试 Mozilla的字符集探测器或的它的一个.NET端口的。

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

这篇关于如何检测文本文件的字符编码​​?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆