C# 垃圾字符破坏 XElement “漂亮"表示 [英] C# junk characters break XElement "pretty" representation

查看:29
本文介绍了C# 垃圾字符破坏 XElement “漂亮"表示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶尔会遇到一些在元素之间抛出一些垃圾字符的 XML,这似乎混淆了任何内部 XNode/XElement 方法处理美化元素的方法.

以下...

var badNode = XElement.Parse(@"+<内部1/><inner2/></b>"

打印出来

+<inner1/><inner2/></b>

虽然这...

var badNode = XElement.Parse(@"<b><内部1/><inner2/></b>"

给出预期

<内部1/><inner2/></b>

根据调试器,垃圾字符被解析为 XElement 的NextNode"属性,然后显然将剩余的 XML 分配为它的NextNode",从而导致单行美化.>

除了预先筛选 XML 以查找标记标记之间的任何错误字符之外,是否有任何方法可以防止/忽略这种行为?

解决方案

badNode 的缩进很尴尬,因为将非空白 + 字符添加到 code><b> 元素值,该元素现在包含 混合内容,W3C定义如下:

<块引用>

3.2.2 混合内容

[定义:元素类型具有混合内容,当该类型的元素可能包含字符数据时,可选择穿插子元素.]

元素中混合内容的存在会触发 XmlWriter(由 XElement.ToString() 在内部使用以实际将自身写入 XML 字符串)在 XmlWriterSettings.Indent 的文档说明:

<块引用>

此属性仅适用于 XmlWriter 输出文本内容的实例;否则,此设置将被忽略.

只要元素不包含混合内容,元素就会缩进.一旦调用 WriteString 或 WriteWhitespace 方法来写出混合元素内容,XmlWriter 将停止缩进.一旦混合内容元素关闭,缩进就会恢复.

这解释了您所看到的行为.

作为一种解决方法,解析您的 XML 与 LoadOptions.PreserveWhitespace,它在解析时保留微不足道的空白,可能正是您想要的:

var badNode = XElement.Parse(@"+<内部1/><inner2/></b>",LoadOptions.PreserveWhitespace);Console.WriteLine(badNode);

输出:

+<内部1/><inner2/></b>

演示小提琴 #1 此处.

或者,如果你确定 badNode 不应该有字符数据,你可以在解析后手动剥离它:

badNode.Nodes().OfType().Remove();

现在 badNode 将不再包含混合内容,XmlWriter 会很好地缩进.

演示小提琴 #2 此处.

I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.

The following...

var badNode = XElement.Parse(@"<b>+
  <inner1/>
  <inner2/>
</b>"

prints out

<b>+
  <inner1 /><inner2 /></b>

while this...

var badNode = XElement.Parse(@"<b>
  <inner1/>
  <inner2/>
</b>"

gives the expected

<b>
  <inner1 />
  <inner2 />
</b>

According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.

Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?

解决方案

You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:

3.2.2 Mixed Content

[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]

The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:

This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.

The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.

This explains the behavior you are seeing.

As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:

var badNode = XElement.Parse(@"<b>+
  <inner1/>
  <inner2/>
</b>",          
                             LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);

Which outputs:

<b>+
  <inner1 />
  <inner2 />
</b>

Demo fiddle #1 here.

Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:

badNode.Nodes().OfType<XText>().Remove();

Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.

Demo fiddle #2 here.

这篇关于C# 垃圾字符破坏 XElement “漂亮"表示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆