php - 在字符串中检测HTML并用代码标记包装 [英] php - detect HTML in string and wrap with code tag
问题描述
我在处理文本内容中的HTML时遇到了麻烦。我正在考虑一种方法来检测这些标签,并将所有连续的标签包装在代码标签中。
请勿包装< p> < / div>< div class =text> wrap me please!< / div>< span class =title> wrap me either!< / span>
请勿包装我< h1>结束< / h1>
。
//预期结果
请勿包装< code>< p> Hello< / p>< div class =text>请包装我! < / div>< span class =title>包装我!< / span>< / code>
不要包装我< code> ;< h1> End< / h1>< / code>
。
这可能吗?
在这种情况下很难使用DOMDocument,因为它使用< p> $ c $自动包装文本节点c>标签(并添加doctype,head,html)。一种方法是使用
(?(DEFINE)...)
特性和命名的子模式将一个模式构造为词法分析器:
$ html =<<<< EOD
请勿包装< p> Hello< / p>< div class =text>请包裹我!< / div>< span class =title>包装我!< / span>不要包裹我< h1>完< / h1>
EOD;
$ pattern =<'EOD'
〜
(?(DEFINE)
(?< self>< [^ \\ (< comment><! - (?> [^ - ] ++ | - (?! - >))* - >)
(?< cdata> \ Q<![CDATA [\ E(?> [^]] ++ |](?!]>))*]]> )
(< text> [^< ++])
(< tag>
<([^ \W_] ++)[^>] *>
(?> \g< text> | \g< tag> | \g< self> | \g< comment> | \g< cdata>)*
< / \ g {-1}>
)
)
#主图案
(?:\ g< tag> | \ g< self> | \g< comment> | \g< cdata>)+
〜x
EOD;
$ html = preg_replace($ pattern,'< code> $ 0< / code>',$ html);
echo htmlspecialchars($ html);
(?(DEFINE)..)
功能允许将定义部分放入正则表达式模式中。这个定义部分和里面的命名子模式并不匹配,它们在这里稍后将用于主模式。
(?< ; abcd> ...)
定义了一个子模式,您可以稍后使用 \g< abcd>
。在上述模式中,以这种方式定义的子模式是:
自: comment : cdata : text : 标记: 一旦所有项目都已设置并且定义部分关闭,可以轻松写出主要图案。 I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags. Don't wrap me //expected result Don't wrap me Is this possible? It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with The self: comment: cdata: text: tag: Once all items have been set and the definition section closed, you can easily write the main pattern. 这篇关于php - 在字符串中检测HTML并用代码标记包装的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
[^ \W _]
是获取没有下划线的 \ w
的技巧。 [^ \W] ++
表示标签名称,并在标签
子模式中使用。>
[^>] *
表示所有不是>
零次或多次。
(?> [^ - ] ++ | - ($! - >))*
描述了html注释中所有可能的内容:
(?>#打开一个原子组
[^ - ] ++#所有不是文字的 - 一次或多次(所有格)
|#OR
- #a文字 -
(?! - >)#未跟随 - >(负向预测)
)*#关闭并重复零次或多次
\Q..\\之间的所有字符\\ E
被看作文字字符,像 [
]这样的特殊字符不需要转义。 (这只是让模式更具可读性的一个技巧)。在CDATA中允许的内容与HTML注释中的内容相同。
[^ <++]
所有字符,直到打开尖括号或字符串的结尾。
这是最有趣的子模式。第1行和第3行是开始和结束标记。请注意,在第1行中,标签名称由捕获组捕获。在第3行中, \g {-1}
是指最后定义的捕获组所匹配的内容(-1表示左边一个)。<第2行描述了开始标签和结束标签之间的可能内容。您可以看到,此描述不仅使用之前定义的子模式,而且使用当前子模式本身来允许嵌套标记。
<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span>
Don't wrap me <h1>End</h1>
.<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>
Don't wrap me <code><h1>End</h1></code>
.<p>
tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...)
feature and named subpatterns:$html = <<<EOD
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<self> < [^\W_]++ [^>]* > )
(?<comment> <!-- (?>[^-]++|-(?!->))* -->)
(?<cdata> \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
(?<text> [^<]++ )
(?<tag>
< ([^\W_]++) [^>]* >
(?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
</ \g{-1} >
)
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;
$html = preg_replace($pattern, '<code>$0</code>', $html);
echo htmlspecialchars($html);
(?(DEFINE)..)
feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.(?<abcd> ...)
defines a subpattern you can reuse later with \g<abcd>
. In the above pattern, subpatterns defined in this way are:
[^\W_]
is a trick to obtain \w
without the underscore. [^\W]++
represents the tag name and is used too in the tag
subpattern.
[^>]*
means all that is not a >
zero or more times.
(?>[^-]++|-(?!->))*
describes all the possible content inside an html comment:(?> # open an atomic group
[^-]++ # all that is not a literal -, one or more times (possessive)
| # OR
- # a literal -
(?!->) # not followed by -> (negative lookahead)
)* # close and repeat the group zero or more times
All characters between \Q..\E
are seen as literal characters, special characters like [
don't need to be escaped. (This only a trick to make the pattern more readable).
The content allowed in CDATA is described in the same way than the content in html comments.[^<]++
all characters until an opening angle bracket or the end of the string.
This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1}
refers to the content matched by the last defined capturing group ("-1" means "one on the left").
The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.