XmlReader在UTF-8 BOM上中断 [英] XmlReader breaks on UTF-8 BOM
问题描述
我的应用程序中包含以下XML解析代码:
I have the following XML Parsing code in my application:
public static XElement Parse(string xml, string xsdFilename)
{
var readerSettings = new XmlReaderSettings
{
ValidationType = ValidationType.Schema,
Schemas = new XmlSchemaSet()
};
readerSettings.Schemas.Add(null, xsdFilename);
readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
readerSettings.ValidationEventHandler +=
(o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };
var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);
return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
}
我正在使用它将发送到WCF服务的字符串解析为XML文档,以进行自定义反序列化.
I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.
当我读入文件并通过电线发送它们(请求)时,它可以正常工作;我已经验证了物料清单没有发送出去.在我的请求处理程序中,我正在序列化一个响应对象,并将其作为字符串发送回.序列化过程会在字符串的前面添加一个UTF-8 BOM,这会导致在解析响应时打破相同的代码.
It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.
System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.
在过去一个小时左右的时间里,我完成了一项研究,看来XmlReader应该尊重BOM.如果我从字符串的前面手动删除BOM,则响应xml解析良好.
In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.
我错过了明显的东西,或者至少是阴险的东西吗?
Am I missing something obvious, or at least something insidious?
这是我用来返回响应的序列化代码:
Here is the serialization code I'm using to return the response:
private static string SerializeResponse(Response response)
{
var output = new MemoryStream();
var writer = XmlWriter.Create(output);
new XmlSerializer(typeof(Response)).Serialize(writer, response);
var bytes = output.ToArray();
var responseXml = Encoding.UTF8.GetString(bytes);
return responseXml;
}
如果只是XML错误地包含BOM表的问题,那么我将切换到
If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to
var responseXml = new UTF8Encoding(false).GetString(bytes);
,但是从我的研究中并不能完全清楚,BOM在实际的XML字符串中是非法的.参见例如 c#从字节数组检测xml编码?
but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. c# Detect xml encoding from Byte Array?
推荐答案
xml字符串不能(!)包含BOM,BOM仅允许使用UTF-8编码的字节数据(例如流).这是因为未对字符串表示形式进行编码,而是已经对Unicode字符序列进行了编码.
The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.
因此,您似乎错误地加载了字符串,而不幸的是您没有提供该代码.
It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.
感谢发布序列化代码.
您不应将数据写入MemoryStream,而应写入StringWriter,然后可以使用ToString将其转换为字符串.由于这样可以避免传递字节表示形式,因此不仅速度更快,而且还避免了此类问题.
You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.
类似这样的东西:
private static string SerializeResponse(Response response)
{
var output = new StringWriter();
var writer = XmlWriter.Create(output);
new XmlSerializer(typeof(Response)).Serialize(writer, response);
return output.ToString();
}
这篇关于XmlReader在UTF-8 BOM上中断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!