如何解开HTML,然后用XSLT进行转换? [英] How do I unescape HTML, then transform it with XSLT?

查看:118
本文介绍了如何解开HTML,然后用XSLT进行转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对XSLT来说相当新鲜,我有一个大型的XML文档,我正在尝试转换为ICML(Adobe InDesign使用的XML变体)。我正在使用的源文档的相关部分看起来像这样:

 < BiographicalNote> 
& lt; p& gt;此文本包括转义的HTML实体。& lt; / p& gt;
< / BiographicalNote>

XML本身很好,但它包含的HTML已被转义。



这是一个粗略的例子,我需要最终产品如下所示:

  ParagraphStyleRange> 
< CharacterStyleRange>
<内容>
本文包含转义的HTML实体。
< / Content>
< / CharacterStyleRange>
< / ParagraphStyleRange>

我可以将< BiographicalNote> < ParagraphStyleRange>< CharacterStyleRange>< Content> 没有问题,但转义的实体正在扼杀我。我似乎无法删除< p> 标签。



一些重要的注意事项:




  • 源文档的HTML部分是由各种熟悉HTML的人编写的,并不总是很好。整个源文档不是一个选项,因为它在运行XSLT时会导致解析器错误。

  • 源文档非常大(超过12万行),所以这是非常不切实际的和时间的 - 消除找到并修复格式错误的HTML。但是,在我实际需要的文件的特定部分(小于1%)中修复任何不良的HTML是非常可行的。

  • 虽然我想删除< p> 标签,我需要保留大多数其他标签(< i> ; em> < b> 等),以便我可以将它们转换为< CharacterStyleRange> / code>标签。

  • 我正在本地编写XSLT,并使用终端(Mac)上的xsltproc运行转换。最终,我将迁移到PHP系统并在服务器端运行转换。



我的基本模板如下所示:

 < xsl:template match =BiographicalNote> 
< ParagraphStyleRange>
< CharacterStyleRange>
< Content>
...
< / Content>
< / CharacterStyleRange> ;
< / ParagraphStyleRange>
< / xsl:template>

所以这就是我需要弄清楚的< Content> 标签,这是我试过的:

 < xsl:call-template name =DescriptionParser> 
< xsl:with-param name =DescriptionText>< xsl:value- select =。disable-output-escaping =yes/>< / xsl:with-param>
< / xsl:call-template>

< xsl :template name =DescriptionParser>
< xsl:param name =DescriptionText/>
< xsl:copy-of select =exsl:node-set($ DescriptionText)/ p/>
< / xsl:template>

And:

 < xsl:variable name =TaglineText>< xsl:value-of select =。disable-output-escaping =是/>&l吨; / XSL:可变> 
< xsl:copy-of select =exsl:node-set($ TaglineText)/ p/>

这两个收益率和空值< Content> 标签。可疑的是,如果 select =exsl:node-set($ TaglineText),它按预期工作,并返回< p>包括转义的HTML实体。< / p> 与所有未转义的。



另外,使用 xsl:value 而不是 xsl:copy-of select =exsl:node-set($ TaglineText)/ p(不返回任何东西);但是当 select =exsl:node-set($ TaglineText)它返回原始的转义的 HTML。



由于某些原因,它似乎不能将< p> 标签识别为节点,因此无法找到它。可能 disable-output-escaping 不适用 exsl:node-set



任何人都可以告诉我如何让XSLT将< p> 标签识别为节点,或至少为什么这是不工作我从其他StackOverflow主题中得到了大部分的拼图,但是我在这一点上被困了。

解决方案

我是不知道你的问题是什么。转义的文本不是XML,不能作为XML处理。没有可以选择的节点,所以您最希望的结果是:

 < Content> 
< p>此文本包括转义的HTML实体。< / p>
< / Content>

这很容易使用:

 <内容> 
< xsl:value-of select =。禁用输出转义=是 />
< / Content>

如果要删除包装元素,您必须使用字符串函数。如果您可以确定包装元素是< p> (或任何其他字符串长度为1的标签),您可以执行以下操作:

 < Content> 
< xsl:variable name =textselect =normalize-space(。)/>
< xsl:value-of select =substring($ text,4,string-length($ text) - 7)disable-output-escaping =yes/>
< / Content>

或者,将此转换的结果保存到文件中,并处理生成的文件。但是,这需要生成的文件是一个格式良好的XML文档 - 我明白你无法确定。


I'm fairly new to XSLT, and I have a large XML document that I'm trying to transform into ICML (an XML variant used by Adobe InDesign). The relevant portion of the source document I'm working with looks something like this:

<BiographicalNote>
 &lt;p&gt;This text includes escaped HTML entities.&lt;/p&gt;
</BiographicalNote>

The XML itself is fine but the HTML it contains is escaped.

And here is a rough example of what I need the end product to look like:

<ParagraphStyleRange>
 <CharacterStyleRange>
  <Content>
   This text includes escaped HTML entities.
  </Content>
 </CharacterStyleRange>
</ParagraphStyleRange>

I can transform <BiographicalNote> to <ParagraphStyleRange><CharacterStyleRange><Content> no problem, but the escaped entities are stumping me. I can't seem to strip out the <p> tags.

Some important considerations:

  • The HTML portions of the source document was written by a variety of people with different levels of familiarity with HTML, and is not always well formed. Unescaping the whole source document is not an option since it causes parser errors when running the XSLT.
  • The source document is very large (more than 120,000 lines) so it would be incredibly impractical and time-consuming to find and fix the malformed HTML. It is much more feasible, though, to fix any bad HTML within the specific parts of the file I actually need (less than 1%).
  • While I want to strip out the <p> tags, I need to preserve most other tags (<i>,<em>,<b>, etc.) so that I can transform them into <CharacterStyleRange> tags later.
  • I'm currently writing my XSLT locally and running the transform using xsltproc on the Terminal (Mac). Eventually, though, I will migrate to a PHP system and run the transformations on the server side.

My basic template looks like this:

<xsl:template match="BiographicalNote">
 <ParagraphStyleRange">
  <CharacterStyleRange>
   <Content>
   ...
   </Content>
  </CharacterStyleRange>
 </ParagraphStyleRange>
</xsl:template>

So it's what goes inside the <Content> tags I need to figure out. Here's what I've tried:

<xsl:call-template name="DescriptionParser">
 <xsl:with-param name="DescriptionText"><xsl:value-of select="." disable-output-escaping="yes" /></xsl:with-param>
</xsl:call-template>

<xsl:template name="DescriptionParser">
 <xsl:param name="DescriptionText" />
 <xsl:copy-of select="exsl:node-set($DescriptionText)/p" />
</xsl:template>

And:

<xsl:variable name="TaglineText"><xsl:value-of select="." disable-output-escaping="yes" /></xsl:variable>
<xsl:copy-of select="exsl:node-set($TaglineText)/p" />

Both of these yield and empty <Content> tag. Suspiciously, though, if select="exsl:node-set($TaglineText)", it works as expected and returns <p>This text includes escaped HTML entities.</p> with everything unescaped.

Also, using xsl:value-of instead of xsl:copy-of makes no difference when select="exsl:node-set($TaglineText)/p" (returns nothing); but when select="exsl:node-set($TaglineText)" it returns the original escaped HTML.

For some reason, it doesn't seem to recognize the <p> tag as a node, and therefore can't find it. Maybe disable-output-escaping isn't playing nice with exsl:node-set?

Can anyone tell me how to get the XSLT to recognize the <p> tags as nodes, or at the very least why this isn't working? I got most of the pieces to this puzzle from other StackOverflow topics, but I'm stumped on this bit.

解决方案

I am not sure what your question is. Escaped text is not XML and cannot be processed as XML. There are no nodes you can select, so the best you can hope for is a result of:

<Content>
<p>This text includes escaped HTML entities.</p>
</Content>

which is easy to get using:

<Content>
    <xsl:value-of select="." disable-output-escaping="yes"/>
</Content>

If you want to remove the wrapping element, you must do so using string functions. If you can be sure that the wrapping element is <p> (or any other tag with string-length of 1), you can do:

<Content>
    <xsl:variable name="text" select="normalize-space(.)" />
    <xsl:value-of select="substring($text, 4, string-length($text) - 7)" disable-output-escaping="yes"/>
</Content>

Alternatively, save the result of this transformation to a file, and process the resulting file. However, this requires that the resulting file be a well-formed XML document - I understand you cannot be sure of that.

这篇关于如何解开HTML,然后用XSLT进行转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆