确定TextFile编码? [英] Determine TextFile Encoding?

查看:172
本文介绍了确定TextFile编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要确定一个文本文件的内容是否等于这些文本编码之一:

  System.Text.Encoding .ASCII 
System.Text.Encoding.BigEndianUnicode'UTF-L 16
System.Text.Encoding.Default'ANSI
System.Text.Encoding.Unicode'UTF16
系统。 Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8

我不知道如何读取文件的字节标记,我看到这样做的片段,但只能确定文件是ASCII还是Unicode,因此我需要更普遍的东西。

解决方案

第一步是将文件作为字节数组而不是字符串加载。字符串始终存储在具有UTF-16编码的存储器中,因此一旦将其加载到字符串中,原始编码就会丢失。以下是将文件加载到字节数组中的一种简单示例:

  Dim data()As Byte = File.ReadAllBytes (test.txt)

自动确定给定字节数组的正确编码是非常困难的。有时,为了有帮助,数据的作者将在数据的开头插入一个名为BOM(Byte Order Mark)的东西。如果存在BOM,则会检测编码无痛,因为每个编码使用不同的BOM。



从BOM自动检测编码的最简单方法是让 StreamReader 为您做。在 StreamReader 的构造函数中,您可以将 True detectEncodingFromByteOrderMarks 参数。然后,您可以通过访问其 CurrentEncoding 属性来获取流的编码。但是,在 StreamReader 已读取BOM之前, CurrentEncoding 属性将不起作用。所以你首先必须先阅读BOM,然后才能获得编码,例如:

 公共函数GetFileEncoding(filePath As字符串)作为编码
使用sr作为新的StreamReader(filePath,True)
sr.Read()
返回sr.CurrentEncoding
结束使用
结束函数

但是,这种方法的问题是 MSDN 似乎意味着 StreamReader 只能检测某些类型的编码:


detectEncodingFromByteOrderMarks参数通过查看流的前三个字节来检测编码。如果文件以适当的字节顺序标记启动,它将自动识别UTF-8,小端Unicode和大端Unicode文本。有关详细信息,请参阅Encoding.GetPreamble方法。


另外,如果 StreamReader 无法从BOM确定编码,或者如果BOM不存在,则它将默认为UTF-8编码,而不会给出任何失败的指示。如果您需要更精细的控制,您可以很容易地阅读BOM并自行解读。所有您需要做的是将字节数组中的前几个字节与一些已知的预期BOM进行比较,以查看它们是否匹配。以下是一些常见BOM的列表:




  • UTF-8: EF BB BF

  • UTF-16大字节顺序: FE FF

  • UTF-16小endian字节顺序: FF FE

  • UTF-32大字节顺序: 00 00 FE FF

  • UTF-32小字节顺序: FF FE 00 00



所以,例如,要查看字节数组的开头是否存在UTF-16(小端)BOM,您可以简单地执行以下操作: (数据(0)=& HFF)和(数据(1)=& HFE)然后
'数据以UTF-16(小端)开头BOM
结束如果

方便的是, code> .NET中的编码类包含一个名为 GetPreamble 的方法,该方法返回编码使用的BOM,因此您甚至不会需要记住它们是什么。因此,要检查字节数组是否以Unicode开头(UTF-16,little-endian),您可以这样做:

 函数IsUtf16LittleEndian(data()as Byte)As Boolean 
Dim bom()As Byte = Encoding.Unicode.GetPreamble()
If(data(0)= bom(0))和(数据(1)= bom(1)然后
返回True
Else
返回False
结束如果
结束函数

当然,上述函数假设数据长度至少为两个字节,BOM正好是两个字节,所以,虽然说明如何尽可能清晰地做到这一点,它不是最安全的方法,为了使其容忍不同的阵列长度,特别是因为BOM长度本身可以从一个编码到下一个编码,所以做一些更安全像这样:

 函数IsUtf16LittleEndian(data()as Byte)As Boolean 
Dim bom()As Byte = Encoding .Unicode.GetPreamble()
返回da ta.Zip(bom,Function(x,y)x = y).All(Function(x)x)
结束函数

那么问题就变成了,你如何得到所有编码的列表?那么这样的事情,.NET Encoding 类也提供了一个名为 GetEncodings 的共享(静态)方法,它返回一个所有支持的编码对象的列表。因此,您可以创建一个循环遍历所有编码对象的方法,获取每个编码对象的BOM,并将其与字节数组进行比较,直到找到匹配的对象。例如:

 公共功能DetectEncodingFromBom(data()As Byte)作为编码
返回Encoding.GetEncodings()。
选择(Function(info)info.GetEncoding())。
FirstOrDefault(函数(enc)DataStartsWithBom(data,enc))
结束函数

私有函数DataStartsWithBom(data()As Byte,enc As Encoding)As Boolean
Dim bom()As Byte = enc.GetPreamble()
如果bom.Length<> 0然后
返回数据。
Zip(bom,Function(x,y)x = y)。
全部(函数(x)x)
Else
返回False
结束If
结束函数

一旦你做了这样的功能,那么你可以检测这样一个文件的编码:

  Dim data()As Byte = File.ReadAllBytes(test.txt)
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
如果detectedEncoding为Nothing然后
Console.WriteLine(无法检测编码)
Else
Console.WriteLine(detectedEncoding.EncodingName)
结束如果

但是,问题依然存在,当没有BOM时,如何自动检测正确的编码?在技​​术上,建议您在使用UTF-8时不要在数据的开头放置BOM,并且没有为任何ANSI代码页定义BOM。所以这绝对不是一个文本文件可能没有BOM的可能性。如果您处理的所有文件都是英文版,假设没有BOM存在,那么UTF-8就足够了。但是,如果任何文件碰巧使用别的东西,没有BOM,那么这将不起作用。正如你所正确观察到的那样,有些应用程序仍然自动检测编码,即使没有BOM,但是它们通过启发式(即受过教育的猜测)来进行编码,有时候它们不准确基本上,它们使用每个编码加载数据,然后查看数据看起来是否可理解。 此页面提供了一些有趣的见解记事本自动检测算法中的问题。 此页面显示如何利用Internet Explorer使用的基于COM的自动检测算法(在C#中)。这里列出了一些人们编写的C#库,它们尝试自动检测字节数组的编码,您可能会觉得有用:





甚至尽管此问题针对C#,但您也可以找到答案它很有用。


I need to determine if a text file's content is equal to one of these text encodings:

System.Text.Encoding.ASCII
System.Text.Encoding.BigEndianUnicode ' UTF-L 16
System.Text.Encoding.Default ' ANSI
System.Text.Encoding.Unicode ' UTF16
System.Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8

I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.

解决方案

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.

The easiest way to automatically detect the encoding from the BOM is to let the StreamReader do it for you. In the constructor of the StreamReader, you can pass True for the detectEncodingFromByteOrderMarks argument. Then you can get the encoding of the stream by accessing its CurrentEncoding property. However, the CurrentEncoding property won't work until after the StreamReader has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

Public Function GetFileEncoding(filePath As String) As Encoding
    Using sr As New StreamReader(filePath, True)
        sr.Read()
        Return sr.CurrentEncoding
    End Using
End Function

However, the problem to this approach is that the MSDN seems to imply that the StreamReader may only detect certain kinds of encodings:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Also, if the StreamReader is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

If (data(0) = &HFF) And (data(1) = &HFE) Then
    ' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encoding class in .NET contains a method called GetPreamble which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    If (data(0) = bom(0)) And (data(1) = bom(1) Then
        Return True
    Else
        Return False
    End If
End Function

Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function

So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding class also provides a shared (static) method called GetEncodings which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
    Return Encoding.GetEncodings().
        Select(Function(info) info.GetEncoding()).
        FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function

Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
    Dim bom() As Byte = enc.GetPreamble()
    If bom.Length <> 0 Then
        Return data.
            Zip(bom, Function(x, y) x = y).
            All(Function(x) x)
    Else
        Return False
    End If
End Function

Once you make a function like that, then you could detect the encoding of a file like this:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
    Console.WriteLine("Unable to detect encoding")
Else
    Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.

As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

Even though this question was for C#, you may also find the answers to it useful.

这篇关于确定TextFile编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆