使用XPath和regex提取HTML注释中的文本 [英] Extract text in HTML comment using XPath and regex

查看:145
本文介绍了使用XPath和regex提取HTML注释中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用XML / HTML解析器解析HTML文件,其中包含隐藏的注释文本以进行翻译,即下面的X和Y。

I'm trying to parse HTML files using an XML/HTML parser which contain hidden commented text for translation, namely X and Y below.

<!-- Title: " X " Tags: " Y " -->

哪个XPath最适合X和Y? // comment()函数匹配整个节点,但是我需要匹配 引号。

Which XPath would best match X and Y? The //comment() function matches the whole node but I need to match the two occurences of text between " and " quotes.

我想可能需要XPath和正则表达式的组合才能做到这一点,但我不知道如何解决。

I guess one would need a combination of XPath and regular expressions to do that but I'm not sure how to tackle that.

推荐答案

我假设注释中的引号是相同的常规qoute字符 -不是显示此问题时出现的印刷上不同的开始和结束引号。

I assume that the quotes in the comment are the same, regular qoute character " -- not the typographically different starting and ending quote that appears when this question is displayed.

如果此假设错误,只需将以下表达式中的标准报价替换为相应的报价即可。

In case this assumption is wrong, simply replace the standard quote in the below expressions with the respective quote.

使用(如果有问题的注释是文档中的第一个注释):

Use (if the comment in question is the first one in the document):

substring-before(substring-after(//comment(), '"'), '"')

这会产生字符串(不包含引号):

X

引号中的第二个字符串使用:

substring-before(
   substring-after(
        substring-after(
               substring-after(//comment(), '"'), 
               '"'), 
        '"'), 
   '"')

基于XSLT的验证(由于XSLT样式表必须是格式正确的XML文档,因此我们将表达式中的引号替换为实体& c >-只是为了避免由于嵌套引号引起的错误):

XSLT - based verification (Because an XSLT stylesheet must be a well-formed XML document we replace the quotes in the expressions with the entity &quot; -- just to avoid errors due to nested quotes):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     "<xsl:copy-of select="substring-before(substring-after(//comment(), '&quot;'), '&quot;')"/>"
=============
   "<xsl:copy-of select=
   "substring-before(substring-after(substring-after(substring-after(//comment(), '&quot;'), '&quot;'), '&quot;'), '&quot;')"/>"
 </xsl:template>
</xsl:stylesheet>

针对此XML文档应用此转换时:

<html>
  <body>
    Hello.
<!-- Title: " X " Tags: " Y " -->
  </body>
</html>

对两个XPath表达式求值,并将这两个求值的结果复制到输出中(用引号引起来以显示复制的确切字符串):

the two XPath expressions are evaluated and the results of these two evaluations are copied to the output (surrounded by quotes to show the exact strings copied):

     " X "
=============
   " Y "

这篇关于使用XPath和regex提取HTML注释中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆