xml 格式不正确,因为 CDATA 中有一个特殊字符 [英] xml not well-formed because a special character inside CDATA
问题描述
我有这个 xml:
<rss xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/";xmlns:content="http://purl.org/rss/1.0/modules/content/";xmlns:wfw=http://wellformedweb.org/CommentAPI/"xmlns:dc="http://purl.org/dc/elements/1.1/";xmlns:wp="http://wordpress.org/export/1.2/";版本=2.0"><频道><wp:wxr_version>1.2</wp:wxr_version><项目><title type="html"><![CDATA[ <h1 class="title">带有特殊字符的标题"</h1>]]><内容:编码类型=html"><![CDATA[ <div class="content clearfix"><p>内容示例文本</p>
]]></content:encoded><wp:post_id>0</wp:post_id><wp:post_date>2000-09-30T10:22:00.001Z</wp:post_date></项目></频道></rss>
html 标题标签里面有一个 unicode 字符:U+0007
为什么 xml 无效?
我正在使用 CDATA,这不是为了使其有效吗?
在构建 xml 之前,如何验证哪些符号无效并删除它们?
让我们明确一下,我们在谈论 XML 是否是 格式良好而不是无效.
U+0007
是一个控制字符 (BEL),过去用于使终端发出哔哔声.它在 XML 中是不允许的,即使在 CDATA 中也是如此.如果它在数据中,那么数据就不是 XML.您的选择是删除它或对其进行编码,使其不直接存在于数据中(以便接收者了解如何对其进行解码);对于必须能够表示此类非法字符的任何元素的内容,一种编码选项是 Base64.
另见
XML 1.0 与 1.1
迈克尔·凯 有帮助的评论 XML 1.1 允许额外的字符,包括 U+0007
(
),超出允许的范围在 XML 1.0 中.
例如,考虑以下文档1:
<r><e1></e1><!-- e1 包含文字 U+0007 字符 --><e2></e2><!-- 变成 U+0007 字符 --><e3><![CDATA[]]></e3><!-- e3 CDATA 包含文字 U+0007 字符 --><e4><![CDATA[]]></e4><!-- 仍然是未解释的字符串 --></r>
在 XML 声明中使用 XML 1.0 版本设置:
U+0007
在e1
、e2
和e3
中的字符阻止 XML 成为 格式良好.
在 XML 声明中使用 XML 1.1 版本设置:
U+0007
仅在e1
和e3
中的字符会阻止 XML 格式良好.
1 请注意,问题源(可通过问题的编辑链接查看)确实包含文字 U+0007 字符,即使格式化的 XML 没有.
I have this xml:
<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/" version="2.0">
<channel>
<wp:wxr_version>1.2</wp:wxr_version>
<item>
<title type="html">
<![CDATA[ <h1 class="title">"Title with special character"</h1> ]]>
</title>
<content:encoded type="html">
<![CDATA[ <div class="content clearfix">
<p>Content Example Text</p>
</div> ]]>
</content:encoded>
<wp:post_id>0</wp:post_id>
<wp:post_date>2000-09-30T10:22:00.001Z</wp:post_date>
</item>
</channel>
</rss>
Inside the html title tag there is the unicode character: U+0007
Why is the xml invalid?
I'm using CDATA, is this not supose to make it valid?
What can I do to validate which symbols are invalid and remove them before constructing the xml?
Let's be clear that we're talking about whether the XML is well-formed rather than invalid.
U+0007
is a control character (BEL), used in the past to cause a terminal to beep. It's not allowed in XML, even within CDATA. If it's in the data, then the data is not XML. Your options are to remove it or encode it so that it's not directly in the data (and so that recipients will understand how to decode it); one encoding option would be Base64 for the contents of any element that has to be able to represent such illegal characters.
See also
- XML Schema. Base64binary type vs String type
- Illegal character in XML feed?
- How to parse invalid (bad / not well-formed) XML?
XML 1.0 vs 1.1
Michael Kay helpfully commented that XML 1.1 allows additional characters, including U+0007
(
), beyond those allowed in XML 1.0.
For example, consider the following document1:
<?xml version="1.0" encoding="UTF-8" ?>
<r>
<e1></e1> <!-- e1 contains a literal U+0007 char -->
<e2></e2> <!--  becomes a U+0007 char -->
<e3><![CDATA[]]></e3> <!-- e3 CDATA contains a literal U+0007 char -->
<e4><![CDATA[]]></e4> <!--  remains an uninterpreted string -->
</r>
With an XML 1.0 version setting in the XML declaration:
U+0007
characters withine1
,e2
, ande3
prevent the XML from being well-formed.
With an XML 1.1 version setting in the XML declaration:
U+0007
characters within onlye1
ande3
prevent the XML from being well-formed.
1 Note that the question source (viewable via the edit link on the question) does indeed contain literal U+0007 characters where noted even though the formatted XML does not.
这篇关于xml 格式不正确,因为 CDATA 中有一个特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!