使用C#检测文本文件的编码 [英] Detect the encoding of a text file using C#

查看:64
本文介绍了使用C#检测文本文件的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组markdown文件要传递给jekyll项目,需要找到它们的编码格式,即使用程序或API使用带BOM的UTF-8或不带BOM或ANSI的UTF-8.

I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .

如果我通过文件的位置,则必须列出文件,读取并作为结果产生编码.

if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .

是否有任何代码或API?

Is there any Code or API for it ?

我已经按照有效方法中所述尝试为流读取器使用sr.CurrentEncoding来查找任何文件的Encoding,但是结果随notepad ++结果而有所不同.

i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .

还尝试使用 https://github.com/errepi/ude (Mozilla通用字符集检测器),如 https://social.msdn.microsoft.com/Forums/vstudio/zh-CN/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file-of-the-file?通过在c#项目中实现ude.dll来= csharpgeneral ,但结果不如notepad ++那样有效,文件编码显示为utf-8,但是在程序中,结果是带有BOM的utf-8.

also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.

但是我从两种方法都应该得到相同的结果,那么问题出在哪里?

but i should get same result from both ways , so where the problem has occurred?

推荐答案

检测编码始终是一项棘手的事情,但是检测BOM仍然非常简单.要将BOM作为字节数组获取,只需使用编码对象的 GetPreamble()函数.这样一来,您就可以通过前导码检测整个编码范围.

Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.

现在,要检测没有前导码的UTF-8,实际上也不是很困难.请参见,UTF8 对于在有效序列中期望的值具有严格的按位规则,您可以初始化UTF8Encoding对象在这些序列为错误.

Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.

因此,如果您先执行BOM表检查,然后进行严格的解码检查,最后又退回到Win-1252编码(您称其为"ANSI"),那么您的检测就完成了.

So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.

Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
    // UTF8 BOM found; use encUtf8Bom to decode.
    try
    {
        // Seems that despite being an encoding with preamble,
        // it doesn't actually skip said preamble when decoding...
        text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
        encoding = encUtf8Bom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
        couldBeUtf8 = false;
    }
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
    // test UTF-8 on strict encoding rules. Note that on pure ASCII this will
    // succeed as well, since valid ASCII is automatically valid UTF-8.
    UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
    try
    {
        text = encUtf8NoBom.GetString(bytes);
        encoding = encUtf8NoBom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
    }
}
// fall back to default ANSI encoding.
if (encoding == null)
{
    encoding = Encoding.GetEncoding(1252);
    text = encoding.GetString(bytes);
}

请注意,Windows-1252(美国/西欧ANSI)是每个字符一个字节的编码,这意味着其中的所有内容都会产生技术上有效的字符,因此

Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.

这篇关于使用C#检测文本文件的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆