XmlDocument.Load失败,LoadXml可以工作: [英] XmlDocument.Load fails, LoadXml works:

查看:233
本文介绍了XmlDocument.Load失败,LoadXml可以工作:的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回答此问题时,我遇到了一种我不了解的情况. OP尝试从以下位置加载XML: http://www.google.com/ig/api?weather=12414&hl=it

In answering this question, I came across a situation that I don't understand. The OP was trying to load XML from the following location: http://www.google.com/ig/api?weather=12414&hl=it

显而易见的解决方案是:

The obvious solution is:

string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml

但是这失败了

XmlException:给定编码中的无效字符.第1行,位置499.

XmlException : Invalid character in the given encoding. Line 1, position 499.

Umiditàà上似乎令人窒息.

OTOH,以下工作正常:

OTOH, the following works fine:

var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
    xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);

我对此感到困惑.谁能解释为什么前者会失败,而后者却能正常工作?

I'm baffled by this. Can anyone explain why the former fails, but the latter works fine?

值得注意的是,文档的xml声明省略了编码.

Notably, the xml declaration of the document omits an encoding.

推荐答案

WebClient使用HTTP响应的标头中的编码信息来确定正确的编码(在本例中为ISO-8859-1 (基于ASCII,即每个字符8位)

The WebClient uses the encoding information in the headers of the HTTP response to determine the correct encoding (in this case ISO-8859-1 which is ASCII based, i.e. 8 bits per character)

XmlDocument.Load似乎没有使用此信息,并且由于xml声明中也缺少编码,因此必须猜测编码并弄错了.进行一些挖掘使我相信它选择了UTF-8.

It looks like XmlDocument.Load doesn't use this information and as the encoding is also missing from the xml declaration it has to guess at an encoding and gets it wrong. Some digging around leads me to believe that it chooses UTF-8.

如果我们想真正地技巧,它会抛出的字符是à",在ISO-8859-1编码中为0xE0,但这在UTF-8中不是有效的字符-特别是二进制表示形式这个字符的是:

If we want to get really technical the character it throws up on is "à", which is 0xE0 in the ISO-8859-1 encoding, but this isn't a valid character in UTF-8 - specifically the binary representation of this character is:

11100000

如果您对 UTF-8 Wikipedia文章有所了解,则可以看到这表示代码点(即字符),总共3个字节,采用以下格式:

If you have a dig around in the UTF-8 Wikipedia article we can see that this indicates a code point (i.e. character) consisting of a total of 3 bytes that take the following format:

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
1110xxxx    10xxxxxx    10xxxxxx

但是,如果我们回顾一下文档,接下来的两个字符是:",在ISO-8859-1中为0x3A和0x20.这意味着我们最终得到的是:

But if we have a look back at the document the next two characters are ": " which is 0x3A and 0x20 in ISO-8859-1. This means what we actually end up with is:

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
11100000    00111010    00100000

序列的第2个或第3个字节都没有10作为两个最高有效位(这将指示连续),因此该字符在UTF-8中没有任何意义.

Neither the 2nd or 3rd bytes of the sequence have 10 as the two most significant bits (which would indicate a continuation), and so this character makes no sense in UTF-8.

这篇关于XmlDocument.Load失败,LoadXml可以工作:的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆