System.IO.File.ReadAllText不会因无效编码而引发异常 [英] System.IO.File.ReadAllText not throwing exception for invalid encoding

查看:433
本文介绍了System.IO.File.ReadAllText不会因无效编码而引发异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在文件utf8.txt中有一些UTF-8文本.该文件包含一些不在ASCII范围内的字符.我尝试了以下代码:

I have some UTF-8 text in a file utf8.txt. The file contains some characters that are outside the ASCII range. I tried the following code:

var fname = "utf8.txt";
var enc = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var s = System.IO.File.ReadAllText(fname, enc);

预期的行为是该代码应引发异常,因为它不是有效的ISO-8859-1文本.取而代之的是,它的行为是它可以正确地将UTF-8文本解码为正确的字符(在调试器中看起来是正确的).

The expected behavior is that the code should throw an exception, since it is not valid ISO-8859-1 text. Instead, the behavior is that it correctly decodes the UTF-8 text into the right characters (it looks correct in the debugger).

这是.Net中的错误吗?

我最初测试的文件是带有BOM的UTF-8.如果删除BOM表,则行为会发生变化.它 still 不会引发异常,但是会产生不正确的Unicode字符串(该字符串在调试器中 not 看起来正确).

The file I tested with originally was UTF-8 with BOM. If I remove the BOM, the behavior changes. It still does not throw an exception, however it produces an incorrect Unicode string (the string does not look correct in the debugger).

要生成我的测试文件,请运行以下代码:

To produce my test file, run the following code:

var fname = "utf8.txt";
var utf8_bom_e_circumflex_bytes = new byte[] {0xEF, 0xBB, 0xBF, 0xC3, 0xAA};
System.IO.File.WriteAllBytes(fname, utf8_bom_e_circumflex_bytes);

我认为我对所发生的事情有把握(尽管我不同意.Net的部分行为).

I think I have a firm handle on what is going on (although I don't agree with part of .Net's behavior).

  • 如果文件以UTF-8 BOM开头,并且数据是有效的UTF-8,则ReadAllText将完全忽略您传入的编码,并(正确地)将文件解码为UTF-8. (我尚未测试如果BOM是谎言并且文件不是真正的UTF-8,会发生什么情况)我不同意这种行为.我认为.Net应该抛出异常或使用我给它的编码.

  • If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8. (I have not tested what happens if the BOM is a lie and the file is not really UTF-8) I don't agree with this behavior. I think .Net should either throw an exception or use the encoding I gave it.

如果文件没有BOM,则.Net没有确定文本不是真正的ISO-8859-1的简单方法(且100%可靠),因为大多数(全部)UTF-8文本是也是有效的ISO-8859-1,尽管有点乱码.因此,它只是按照您的说明进行操作,并使用您提供的编码对文件进行解码. (我同意这种行为)

If the file has no BOM, .Net has no trivial (and 100% reliable) way to determine that the text is not really ISO-8859-1, since most (all?) UTF-8 text is also valid ISO-8859-1, although gibberish. So it just follows your instructions and decodes the file with the encoding you gave it. (I do agree with this behavior)

推荐答案

应该抛出一个异常,因为它不是有效的ISO-8859-1文本

should throw an exception, since it is not valid ISO-8859-1 text

在ISO-8859-1中,所有可能的字节都有到字符的映射,因此读取非ISO-8859-1文件为ISO-8859-1不会导致异常.

In ISO-8859-1 all possible bytes have mappings to characters, so no exception will ever result from reading a non-ISO-8859-1 file as ISO-8859-1.

(的确,0x80–0x9F范围内的所有字节将成为您永远不希望看到的不可见控制代码,但它们仍然有效,只是无用.这在许多ISO-8859编码中都适用,将C1控制代码的范围设置为0x80–0x9F,但不是全部.您肯定会遇到其他未使字节映射的编码异常,例如Windows-1252.)

(True, all the bytes in the range 0x80–0x9F will become invisible control codes that you never want, but they're still valid, just useless. This is true of quite a few of the ISO-8859 encodings, which put the C1 control codes in the range 0x80–0x9F, but not all. You can certainly get an exception with other encodings that leave bytes unmapped, eg Windows-1252.)

如果文件以UTF-8 BOM开头,并且数据是有效的UTF-8,则ReadAllText将完全忽略您传入的编码,并(正确地)将文件解码为UTF-8.

If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8.

是的.这在文档中有所提示:

Yep. This is hinted at in the doc:

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks.

我同意您的看法,这种行为非常愚蠢.我宁愿选择ReadAllBytes并通过Encoding.GetString进行手动检查.

I agree with you that this behaviour is pretty stupid. I would prefer to ReadAllBytes and check it through Encoding.GetString manually.

这篇关于System.IO.File.ReadAllText不会因无效编码而引发异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆