带有汉字文本内容的 XmlDocument 未使用 XmlTextWriter 正确编码为 ISO-8859-1 [英] XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

查看:42
本文介绍了带有汉字文本内容的 XmlDocument 未使用 XmlTextWriter 正确编码为 ISO-8859-1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 XmlDocument 在其文本内容中包含汉字,我需要使用 ISO-8859-1 编码将其写入流.当我这样做时,没有一个汉字字符被正确编码,而是替换为??".

I have an XmlDocument that includes Kanji in its text content, and I need to write it to a stream using ISO-8859-1 encoding. When I do, none of the Kanji characters are encoded properly, and are instead replaced with "??".

以下示例代码演示了如何从 XmlDocument 编写 XML:

Here is sample code that demonstrates how the XML is written from the XmlDocument:

MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();

在这种特定情况下,如何正确编码汉字?

What can be done to correctly encode Kanji in this specific situation?

推荐答案

正如评论中提到的,出现 ? 字符是因为编码不支持汉字字符 ISO-8859-1,所以它代替了?code> 作为后备字符. 的文档备注中讨论了编码回退编码:

As mentioned in comments, the ? character is showing up because Kanji characters are not supported by the encoding ISO-8859-1, so it substitutes ? as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:

请注意,编码类允许错误(不支持的字符):

Note that the encoding classes allow errors (unsupported characters) to:

这是您所看到的行为.

然而,即使 ISO-8859-1 不支持汉字字符,您可以通过切换到由 XmlWriter 获得更好的结果href="https://msdn.microsoft.com/en-us/library/ms162617.aspx" rel="nofollow noreferrer">XmlWriter.Create(Stream, XmlWriterSettings) 和在 XmlWriterSettings.Encoding像这样:

However, even though Kanji characters are not supported by ISO-8859-1, you can get a much better result by switching to the newer XmlWriter returned by XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding like so:

MemoryStream mStream = new MemoryStream();

var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
    Encoding = enc,
    CloseOutput = false,
    // Remove to enable the XML declaration if you want it.  XmlTextWriter doesn't include it automatically.
    OmitXmlDeclaration = true,  
};
using (var writer = XmlWriter.Create(mStream, settings))
{
    doc.WriteTo(writer);
}

mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();

通过设置 XmlWriterSettingsEncoding 属性,当当前编码不支持某个字符时,XML 编写器就会知道,并自动将其替换为XML 字符实体引用,而不是一些硬编码的回退.

By setting the Encoding property of XmlWriterSettings, the XML writer will be made aware whenever a character is not supported by the current encoding and automatically replace it with an XML character entity reference rather than some hardcoded fallback.

例如假设您有如下所示的 XML:

E.g. say you have XML like the following:

<Root>
  <string>畑 はたけ hatake "field of crops"</string>
</Root>

然后您的代码将输出以下内容,将所有汉字映射到单个后备字符:

Then your code will output the following, mapping all Kanji to the single fallback character:

<Root><string>? ??? hatake "field of crops"</string></Root>

而新版本将输出:

<Root><string>&#x7551; &#x306F;&#x305F;&#x3051; hatake "field of crops"</string></Root>

注意到汉字字符已被替换为字符实体,例如 &#x7551;?所有兼容的 XML 解析器都会识别和重建这些字符,因此即使您的首选编码不支持汉字,也不会丢失任何信息.

Notice that the Kanji characters have been replaced with character entities such as &#x7551;? All compliant XML parsers will recognize and reconstruct those characters, and thus no information will be lost despite the fact that your preferred encoding does not support Kanji.

最后,作为旁注,文档XmlTextWriter 声明:

Finally, as an aside note the documentation for XmlTextWriter states:

从 .NET Framework 2.0 开始,我们建议您改用 System.Xml.XmlWriter 类.

Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.

因此,将其替换为 XmlWriter 总的来说是个好主意.

So replacing it with an XmlWriter is a good idea in general.

示例 .Net fiddle 演示了两个编写器的用法并断言 XmlWriter<生成的 XML/code> 尽管有字符转义,但在语义上等同于原始 XML.

Sample .Net fiddle demonstrating usage of both writers and asserting that the XML generated by XmlWriter is semantically equivalent to the original XML despite the escaping of characters.

这篇关于带有汉字文本内容的 XmlDocument 未使用 XmlTextWriter 正确编码为 ISO-8859-1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆