C#读取具有不同编码的字符的XML的问题 [英] C# Issue with reading XML with chars of different encodings in it

查看:146
本文介绍了C#读取具有不同编码的字符的XML的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在阅读XML时遇到问题。解决方案被发现,但仍有一些问题。错误的XML文件以UTF-8编码,并在其标题中具有适当的标记。但它也包含一个以UTF-16编码的字符 - 'é'。此代码用于读取XML文件以验证其内容:

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:

var xDoc = XDocument.Load(taxFile);

它为指定的不正确的XML文件引发异常:给定编码中的无效字符行59,位置104.快速修复如下:

It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:

XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
    xDoc = XDocument.Load(oReader);
}

此代码不会为错误的文件引发异常。但是'é'字符被加载为 。我的第一个问题是为什么它工作?。

This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".

另一点是使用XmlReader不会引发异常,直到加载有é的节点。

Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.

XmlReader xmlTax = XmlReader.Create(filePath);

再次,StreamReader的锻炼有所帮助。同样的问题。
似乎修复解决方案不够好,导致一天:)以其他格式编码的XML可能会出现,并且可能以错误的方式进行。但是我试图处理UTF-16格式的XML文件,并且工作正常(配置为UTF-8)。

And again the workout with StreamReader helps. The same question. It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).

最后一个问题是如果有任何选项提供给XDocument / XmlReader以忽略这样的字符编码或smth。

The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.

期待您的回复。感谢提前

Looking forward for your replies. Thanks in advance

推荐答案

首先要注意的是,XML文件实际上是有缺陷的 - 将文本编码混合在同一个文件像这样不应该做。当文件实际上嵌入了一个显式编码时,这个错误就更加明显了。

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.

至于为什么可以用StreamReader读取它,这是因为Encoding包含控件的设置遇到不兼容的数据时会发生什么。

As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered

Encoding.UTF8被记录为使用后备字符。从 http://msdn.microsoft.com/en- us / library / system.text.encoding.utf8.aspx

Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:


此属性返回的UTF8Encoding对象可能会没有
您的应用程序的适当行为。它使用替换
后备来替换它不能编码的每个字符串,每个字节
,它不能用问号(?)字符解码。

The UTF8Encoding object that is returned by this property may not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

您可以自己实例化编码以获取不同的设置。这很可能是XDocument.Load()所做的,因为默认情况下隐藏错误通常是不好的。
http://msdn.microsoft.com/en -us / library / system.text.utf8encoding.aspx

You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default. http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx

如果你被发送了这样的破坏的XML文件,步骤1是抱怨(大声)关于它。这种行为没有正当理由。如果你绝对必须处理它们,我建议看看UTF8Encoding类及其DecoderFallbackProperty。看来你应该能够实现一个自定义的DecoderFallback和DecoderFallbackBuffer来添加将理解UTF-16字节序列的逻辑。

If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

这篇关于C#读取具有不同编码的字符的XML的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆