确定文本文件编码? [英] Determine TextFile Encoding?

查看:239
本文介绍了确定文本文件编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要确定文本文件的内容是否等于以下文本编码之一:

  System.Text.Encoding .ASCII 
System.Text.Encoding.BigEndianUnicode'UTF-L 16
System.Text.Encoding.Default'ANSI
System.Text.Encoding.Unicode'UTF16
系统。 Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8

我不知道如何读取文件的字节标记,我已经看到了片段做这个,但只能确定文件是ASCII还是Unicode,因此我需要更通用的东西。

解决方案

第一步是将文件加载为字节数组而不是字符串。字符串总是以UTF-16编码存储在内存中,所以一旦加载到字符串中,原始编码就会丢失。下面是一个将文件加载到字节数组的简单示例:

  Dim data()As Byte = File.ReadAllBytes (test.txt)

自动确定给定字节数组的正确编码非常困难。有时,为了有用,数据的作者将在数据的开始处插入称为BOM(字节顺序标记)的东西。如果存在BOM,这使得检测编码非常无痛,因为每个编码使用不同的BOM。



通过解释BOM自动检测编码的最简单方法是让StreamReader为您完成。在 StreamReader 的构造函数中,您可以为 detectEncodingFromByteOrderMarks 传递 True $ c>参数。然后,您可以通过访问其 CurrentEncoding 属性来获取流的编码。但是,在StreamReader读取BOM之前, CurrentEncoding 属性将不起作用。所以你必须首先阅读BOM,然后才能获得编码,例如:

 公共函数GetFileEncoding String)As Encoding 
使用sr As StreamReader = New StreamReader(filePath,True)
sr.Read()
返回sr.CurrentEncoding
结束使用
结束函数

但是,这种方法的问题是 MSDN 似乎暗示 StreamReader 只能检测某些类型的编码:


detectEncodingFromByteOrderMarks参数通过查看流的前三个字节来检测编码。如果文件以适当的字节顺序标记开始,它会自动识别UTF-8,小端格式Unicode和大端格式Unicode文本。有关更多信息,请参阅Encoding.GetPreamble方法。


此外,如果 StreamReader 不能从BOM确定编码,或者如果BOM不存在,它将默认为UTF-8编码,而不会给你任何方式知道它失败。如果你需要更精细的控制,你可以很容易地读取BOM和自己解释。所有你需要做的是将字节数组中的前几个字节与一些已知的预期BOM进行比较,看看它们是否匹配。以下是一些常见BOM的列表:




  • UTF-8: EF BB BF

  • UTF-16 big endian字节顺序: FE FF

  • 字节顺序: FF FE

  • UTF-32大端字节顺序: 00 00 FE FF

  • UTF-32小尾数字节顺序: FF FE 00 00



例如,要查看字节数组开头是否存在UTF-16(小尾数)BOM,可以这样做:

  If(data(0)=& HFF)And(data(1)=& HFE`Then 
'数据以UTF-16(小端)开始BOM
结束如果

方便地, code> Encoding .NET中的类包含一个名为 GetPreamble 的方法,该方法返回编码使用的BOM,因此,要获得UTF-16(小端)的BOM,您只需这样做:

  Dim bom()As byte = Encoding.Unicode.GetPreamble()
If(data(0)= bom(0))And(data(1)= bom(1)Then
'数据以UTF-16开头(小尾数)BOM
结束如果

.NET Encoding 类提供了一个名为 GetEncodings 的共享方法,它返回所有支持的编码对象的列表。因此,您可以创建一个循环遍历所有编码对象的方法,获取每个编码对象的BOM,并将其与字节数组进行比较,直到找到匹配的对象。例如:

  public function DetectEncodingFromBom(data()As Byte)As Encoding 
Dim detectedEncoding As Encoding = Nothing
每个信息As EncodingInfo In Encoding.GetEncodings()
Dim currentEncoding As Encoding = info.GetEncoding()
Dim preamble()As Byte = currentEncoding.GetPreamble()
Dim match As Boolean = True
If(preamble.Length> 0)And(preamble.Length< = data.Length)then
For i As Integer = 0 To preamble.Length - 1
如果preamble(i)<> data(i)then
match = False
退出对于
结束如果
接下来
否则
匹配= False
结束如果
如果匹配则
detectedEncoding = currentEncoding
退出对于
结束如果
下一个
返回detectedEncoding
结束函数



一旦你创建了这样的函数,你就可以检测到这样的编码:

  Dim data()As Byte = File.ReadAllBytes(test.txt)
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
如果detectedEncoding is Nothing然后
Console.WriteLine(无法检测编码)
否则
Console.WriteLine(detectedEncoding.EncodingName)
如果
pre>

但是,问题仍然存在,当没有BOM时,如何自动检测正确的编码?技术上,建议您在使用UTF-8时不要在数据开头放置BOM,并且没有为任何ANSI代码页定义BOM。因此,一个文本文件可能没有BOM的可能性的领域肯定不是。如果你处理的所有文件都是英文,那么可以假定如果没有BOM存在,UTF-8就足够了,但如果任何文件使用了别的东西,那就不行了。



正如你所指出的,即使没有BOM,仍然有自动检测编码的应用程序,但是他们通过启发式方法(即有根据的猜测)他们不准确。基本上,它们使用每个编码加载数据,然后查看数据是否看起来可理解。 此页面提供了有关记事本检测算法。 此页面显示了如何使用基于COM的Internet Explorer使用的自动检测算法(在C#中)。下面是一些人们编写的尝试自动检测字节数组编码的一些C#库的列表,您可能会发现它们有帮助:





即使是针对C#的,你也可以找到它的答案有用。


I need to determine if a text file's content is equal to one of these text encodings:

System.Text.Encoding.ASCII
System.Text.Encoding.BigEndianUnicode ' UTF-L 16
System.Text.Encoding.Default ' ANSI
System.Text.Encoding.Unicode ' UTF16
System.Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8

I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.

解决方案

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding very painless, since each encoding uses a different BOM.

The easiest way to automatically detect the encoding by interpreting the BOM is to let the StreamReader do it for you. In the constructor of the StreamReader, you can pass True for the detectEncodingFromByteOrderMarks argument. Then you can get the encoding of the stream by accessing its CurrentEncoding property. However, the CurrentEncoding property won't work until after the StreamReader has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

Public Function GetFileEncoding(filePath As String) As Encoding
    Using sr As StreamReader = New StreamReader(filePath, True)
        sr.Read()
        Return sr.CurrentEncoding
    End Using
End Function

However, the problem to this approach is that the MSDN seems to imply that the StreamReader may only detect certain kinds of encodings:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Also, if the StreamReader is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding without giving you any way to know that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

If (data(0) = &HFF) And (data(1) = &HFE` Then
    ' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encoding class in .NET contains a method called GetPreamble which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to get the BOM for UTF-16 (little endian), you could just do this:

Dim bom() As Byte = Encoding.Unicode.GetPreamble()
If (data(0) = bom(0)) And (data(1) = bom(1) Then
    ' Data starts with UTF-16 (little endian) BOM
End If

Also conveniently, the .NET Encoding class provides a shared method called GetEncodings which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
    Dim detectedEncoding As Encoding = Nothing
    For Each info As EncodingInfo In Encoding.GetEncodings()
        Dim currentEncoding As Encoding = info.GetEncoding()
        Dim preamble() As Byte = currentEncoding.GetPreamble()
        Dim match As Boolean = True
        If (preamble.Length > 0) And (preamble.Length <= data.Length) Then
            For i As Integer = 0 To preamble.Length - 1
                If preamble(i) <> data(i) Then
                    match = False
                    Exit For
                End If
            Next
        Else
            match = False
        End If
        If match Then
            detectedEncoding = currentEncoding
            Exit For
        End If
    Next
    Return detectedEncoding
End Function

Once you make a function like that, then you could detect the encoding like this:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
    Console.WriteLine("Unable to detect encoding")
Else
    Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files you deal with are in English, it's probably safe to assume that if no BOM is present, UTF-8 will suffice, but if any file is using a something else, that won't work.

As you point out, there are applications that still automatically detect the encoding anyway, even without a BOM, but they do it through heuristics (i.e. educated guessing), and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

Even though this question was for C#, you may also find the answers to it useful.

这篇关于确定文本文件编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆