帮助使用PHP和XPath [英] Help With PHP and XPath
问题描述
我需要在PHP中使用XPath做一些事情的帮助.
I need help doing a few things with XPath in PHP.
使用任何给定的HTML,我需要:
With any given HTML, I need to:
- 删除所有表及其内容
- 删除第一个h1标签之后的所有内容
- 仅保留段落(包括其内部HTML(链接,列表等))
使用正则表达式,我可以使一切正常运行.但是,当我遇到嵌套表时,我认为用正则表达式解析HTML确实是愚蠢的.
With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex.
非常感谢!
推荐答案
使用任何给定的HTML,我需要:
With any given HTML, I need to:
•删除所有表及其内容
•删除第一个h1之后的所有内容标记
• Remove everything after the first h1 tag
•仅保留段落(包括他们的内部HTML(链接,列表等)
• Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))
这可以通过XSLT轻松完成:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml" >
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- Copy every node except when overriden
by another template -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- Remove all tables and their contents -->
<xsl:template match="h:table"/>
<!-- Remove everything after the first h1 -->
<xsl:template match="node()[preceding::h:h1]"/>
<!-- Keep only paragraphs (INCLUDING
their inner HTML (links, lists, etc))
-->
<xsl:template match=
"node()[not(self::h:p) and not(ancestor::h:p)]">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
如果您的元素名称不在XHtml命名空间中,则只需删除上述代码中 h:
的任何出现.
In case your element names are not in the XHtml namespace, simple delete any occurence of h:
in the above code.
这篇关于帮助使用PHP和XPath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!