XSLT 将文本节点解析为 XML? [英] XSLT parse text node as XML?

查看:22
本文介绍了XSLT 将文本节点解析为 XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我正在转换的 XML 文档中间,有一个 CDATA 节点,我知道它本身是由 XML 组成的.我希望将递归解析"为 XML,以便我也可以对其进行转换.经过搜索,我认为我的问题非常类似于 处理包含内部转义 XML 的节点.

In the middle of an XML document I'm transforming, there is a CDATA node which I know itself is composed of XML. I would like to have that "recursively parsed" as XML so that I can transform it too. Upon searching, I think my question is very similar to Handling node containing inner escaped XML.

那是一年前:我可以澄清以下几点:

That was a year ago: may I just clarify the following:

  1. 它说这不能通过一些 XSLT 一次性完成:相反,您需要一个两阶段的方法.我刚刚买了一本关于 XSLT 2.0 的闪亮的新书.是否仍然没有 XSLT 指令将字符串节点重新解析"为 XML?
  2. 就我而言,XML 字符串节点只是整体中的一个节点.因此,在第 1 阶段,我只会转换输入 XML 文档的一个片段;其余的需要不变地通过第 2 阶段.我看到了几种将输入传递给输出不变的解决方案,但通常看起来它们大部分工作",但跳过/不处理某种节点输入.是否有一种可靠的结构可以将输入的其余部分传递到输出而无需任何更改?
  3. 这种方法依赖于我能够分别应用 2 个转换.我被限制(现有应用程序)只允许 one 转换(XML 输出是固定的;它由一个 XSLT 文件转换;我唯一能做的就是把我喜欢的任何东西放到那个 XSLT 文件中,和/或添加更多 XSLT 文件,但我无法影响通过一个 XSLT 文件传递​​ XML 的顶级调用).有什么我可以放入 XSLT 文件中的东西,这会导致调用第二个 XSLT 转换吗?
  1. It says this cannot be done by some XSLT in one go: rather you need a two-phase approach. I have just bought a shiny new book on XSLT 2.0. Is is still the case that there is no XSLT instruction to "re-parse" a string node as XML?
  2. In my case the XML-string node is just one node in the whole. Therefore in Phase #1 I would only be transforming a fragment of the input XML document; the rest needs passing through unchanged to Phase #2. I see several solutions to passing input to output unchanged, but often it seems they "mostly work", but skip/do not deal with some kind of node inputs. Is there a relaible construct for passing the rest of the input to the output without any changes?
  3. That approach relies on me being able to apply 2 transforms separately. I am limited (existing application) to only being allowed one transform (the XML output is fixed; it is transformed by one XSLT file; the only thing I can do is put whatever I like into that XSLT file, and/or add further XSLT files, but I cannot influence the top-level call to pass the XML through one XSLT file). Is there anything I could put into an XSLT file which could cause the second XSLT transform to be invoked?

推荐答案

见最后更新.

  1. 最重要的问题.这是可能的;问题是您是否必须在 XSLT 中手动编写 XML 解析器,或者使用扩展函数,或者是否有方便、可移植的解决方案.更新:如果您可以使用 Saxon 的 parse() 扩展函数,这是迄今为止最好的选择.你可以访问吗?

  1. the most important question. It's possible to do; the question is whether you'd have to write an XML parser manually in XSLT, or use an extension function, or whether there's a convenient, portable solution. Update: If you can use Saxon's parse() extension function, that's by far your best bet. Do you have access to that?

很容易回答:是的,使用身份转换.这不会保留输入 XML 的所有词法细节,例如属性的顺序,或者 是否写为 ;</foo>.但是,它会保留所有对 XML 处理器来说很重要的细节.

is easy to answer: yes, use the identity transform. This will not preserve all lexical details of the input XML, such as order of attributes, or whether <foo/> is written as <foo></foo>. However it will preserve all details that are supposed to matter to XML processors.

但是,如果您不能在管道中运行 2 个样式表,这对您没有帮助,对吗?

But this won't help you if you can't run 2 stylesheets in a pipeline, right?

嗯……不是很健壮.如果您的输出将由浏览器显示,或由其他理解 XML 样式表处理指令,您可以输出其中之一,并希望(反对规范的建议!)在此样式表和您关联的样式表之间进行序列化和解析输出.但这会非常脆弱.我说反对规范的建议",因为这里它说

Hmm... not robustly. If your output is going to be displayed by a browser, or handled by something else that understands an XML stylesheet processing instruction, you could output one of those, and hope (against the spec's recommendation!) that serialization and parsing would occur in between this stylesheet and the one you associated on output. But this would be very fragile. I say "against the spec's recommendation" because here it says

当这个或任何其他机制产生一个以上的序列要应用的 XSLT 样式表同时到一个 XML 文档,然后效果应该是一样的应用单个样式表导入序列的每个成员订购

When this or any other mechanism yields a sequence of more than one XSLT stylesheet to be applied simultaneously to a XML document, then the effect should be the same as applying a single stylesheet that imports each member of the sequence in order

这意味着,中间没有序列化和解析.不推荐.

which would imply, without serialization and parsing in between. Not recommended.

更新:一条新评论表示您事先不知道哪些元素将包含 CDATA 部分.我得出的结论是,这意味着您不知道哪些元素将包含未解析的数据(因为 XML 处理器正式不知道或关心 CDATA 部分中的哪些元素本身).在这种情况下,所有赌注都将关闭.您可能知道,XML 处理器不应该知道 XML 输入文档的哪些部分在 CDATA 部分中.CDATA 只是转义标记的一种不同方式,是 &lt; 等的替代方法.一旦数据被解析(这不在 XSLT 处理器的管辖范围内),你不知道它最初是如何用标记表达的.左尖括号仍然是左尖括号,无论它是否表示为 <![CDATA[ <]]>&lt;.就像在 C 中一样,将字符指定为 'A' 或 65 或 0x41 都没有关系;一旦程序被编译,你的代码将无法区分.

Update: a new comment says that you don't know in advance which elements will contain CDATA sections. I jumped to the conclusion that this meant you didn't know which elements would contain unparsed data (since XML processors officially don't know or care what elements are in CDATA sections, per se). In that case, all bets are off. As you may know, XML processors are not supposed to know which parts of an XML input doc are in CDATA sections. CDATA is just a different way of escaping markup, an alternative to &lt; etc. Once the data is parsed (which is not properly under the XSLT processor's jurisdiction), you can't tell how it was initially expressed in markup. A left pointy bracket remains a left pointy bracket whether it's expressed as <![CDATA[ < ]]> or &lt;. Just as in C, it doesn't matter whether you specify a character as 'A' or 65 or 0x41; once the program is compiled, your code won't be able to tell the difference.

因此,如果您没有其他方法来确定输入文档中的哪些数据需要解析,那么上述方法都不会帮助您:您无法知道在哪里应用 saxon:parse(),也不会手动解析,也不会使用以下 XSLT 转换禁用输出转义.

Therefore, if you don't have another way of determining which data in your input document needs to be parsed, then none of the above methods will help you: you can't know where to apply saxon:parse(), nor manual parsing, nor disable-output-escaping with a following XSLT transformation.

解决方法:

  • 你可以猜测,例如使用 test="contains(., '&lt;')",哪些节点包含未解析的数据.(请注意,这是对左尖括号的测试,无论它是表示为字符实体,还是 CDATA 部分的一部分,或任何其他方式.)您有时会得到误报,例如如果文本节点包含字符串year <2001".或者您可以尝试解析每个文本节点(非常低效),对于那些成功解析为格式良好的 XML 文档的节点,输出树而不是文本.

  • You could guess, e.g. with test="contains(., '&lt;')", which nodes contain unparsed data. (Note this tests for the left pointy bracket, regardless of whether it's expressed as a character entity, or part of a CDATA section, or any other way.) You'd sometimes get false positives, e.g. if a text node contained the string "year < 2001". Or you could attempt to parse every text node (very inefficient), and for those that parse successfully as well-formed XML documents, output the tree instead of the text.

或者您可以使用非 XML 工具(例如 LexEv)预处理 XML,因此可以看到"CDATA 标记.但是您已经说过您无法控制单个 XSLT 之外的任何内容.

Or you could preprocess the XML with a non-XML tool (like LexEv), which therefore can "see" the CDATA markup. But you've said that you can't control anything outside the single XSLT.

或者,理想情况下,您可以将消息发送回链,表明您得到的 XML 不可行:他们需要以某种方式进行标记,而不是使用 CDATA 标记,哪些部分包含未解析的数据.通常这可以通过指定某些元素名称或使用属性标志来完成.显然,这取决于谁提供 XML.

Or, ideally, you could send the message back up the chain that the XML you're being given is unworkable: they need to flag somehow, other than by using CDATA markup, which sections contain unparsed data. Usually this would be done either by specifying certain element names, or by using attribute flags. Obviously this would depend on who's supplying the XML.

另一个更新好的,现在我明白了:所以您知道哪个元素包含未解析的数据(并且您知道它用 CDATA 标记),但您不知道哪些其他数据可能用 CDATA 标记.

Another update OK, now I understand: so you know which element contains unparsed data (and you know it's marked up with CDATA), but you don't know which other data might be marked up with CDATA.

这个想法是为了改变 [即解析-Lars] 已知CDATA 节点(fred")转换为 XML 节点而离开其余的全部文档作为原始输入,这样它就可以通过管道传输一般"转换

the idea was to transform [i.e. parse -Lars] the known CDATA node ("fred") into XML nodes while leaving the whole of the rest of the document as original input, so that it could then be piped through the "general" transformation

为此,将整个文档的其余部分保留为原始输入"并不意味着保留任何 CDATA 标记.(下游的一般转换不会知道或关心哪些数据是 CDATA 转义的.)所需要的只是解析一个未解析的节点,其余的不解析.身份转换 可以很好地完成后者;您可以忽略该页面关于输出中 CDATA 部分的内容……下游 XSLT 不会知道或关心.(除非您对输出有其他(非 XML)要求而您没有告诉我们.)

For this purpose, "leaving the whole of the rest of the document as original input" does not need to mean preserving any CDATA markup. (The general transformation downstream will not know or care what data is CDATA-escaped.) All that is required is that the one unparsed node get parsed, and the rest, not get parsed. The identity transform will do the latter just fine; you can ignore what that page says about CDATA sections on the output... the downstream XSLT will not know or care. (Unless you have additional (non-XML) requirements for the output that you haven't told us about.)

因此,如果您可以进行两个样式表的转换,在两者之间进行序列化和解析(例如,不在传统的 SAX 管道中),那么身份转换会起作用:您只需要一个已知未解析节点的附加模板,带有禁用输出转义,如 Tomalak 的回答 此处.

So if you could do a two-stylesheet transform, with serialization and parsing in between (i.e. not in a traditional SAX pipeline, for example), then the identity transform would work: you'd just need an additional template for the known unparsed node, with disable-output-escaping, as in Tomalak's answer here.

但是如果您不能进行两步转换……您使用的是什么 XSLT 处理器?可能还有其他特定的途径.

But if you can't do a two-step transform... what XSLT processor are you using? There may be other avenues specific to it.

这篇关于XSLT 将文本节点解析为 XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆