php - 在字符串中检测HTML并用代码标记包装 [英] php - detect HTML in string and wrap with code tag

查看:132
本文介绍了php - 在字符串中检测HTML并用代码标记包装的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在处理文本内容中的HTML时遇到了麻烦。我正在考虑一种方法来检测这些标签,并将所有连续的标签包装在代码标签中。



请勿包装< p> < / div>< div class =text> wrap me please!< / div>< span class =title> wrap me either!< / span> 请勿包装我< h1>结束< / h1>



//预期结果



请勿包装< code>< p> Hello< / p>< div class =text>请包装我! < / div>< span class =title>包装我!< / span>< / code> 不要包装我< code> ;< h1> End< / h1>< / code>



这可能吗?

解决方案

在这种情况下很难使用DOMDocument,因为它使用< p> 标签(并添加doctype,head,html)。一种方法是使用(?(DEFINE)...)特性和命名的子模式将一个模式构造为词法分析器:

  $ html =<<<< EOD 
请勿包装< p> Hello< / p>< div class =text>请包裹我!< / div>< span class =title>包装我!< / span>不要包裹我< h1>完< / h1>
EOD;

$ pattern =<'EOD'

(?(DEFINE)
(?< self>< [^ \\ (< comment><! - (?> [^ - ] ++ | - (?! - >))* - >)
(?< cdata> \ Q<![CDATA [\ E(?> [^]] ++ |](?!]>))*]]> )
(< text> [^< ++])
(< tag>
<([^ \W_] ++)[^>] *>
(?> \g< text> | \g< tag> | \g< self> | \g< comment> | \g< cdata>)*
< / \ g {-1}>


#主图案
(?:\ g< tag> | \ g< self> | \g< comment> | \g< cdata>)+
〜x
EOD;

$ html = preg_replace($ pattern,'< code> $ 0< / code>',$ html);

echo htmlspecialchars($ html);

(?(DEFINE)..)功能允许将定义部分放入正则表达式模式中。这个定义部分和里面的命名子模式并不匹配,它们在这里稍后将用于主模式。



(?< ; abcd> ...)定义了一个子模式,您可以稍后使用 \g< abcd> 。在上述模式中,以这种方式定义的子模式是:


  • self :描述了一个自闭标签
  • b $ b
  • 文本:用于文本(所有不是标签,评论或cdata) :适用于非自封的html标签





[^ \W _] 是获取没有下划线的 \ w 的技巧。 [^ \W] ++ 表示标签名称,并在标签子模式中使用。
[^>] * 表示所有不是> 零次或多次。


comment

(?> [^ - ] ++ | - ($! - >))* 描述了html注释中所有可能的内容:

  (?>#打开一个原子组
[^ - ] ++#所有不是文字的 - 一次或多次(所有格)
|#OR
- #a文字 -
(?! - >)#未跟随 - >(负向预测)
)*#关闭并重复零次或多次

cdata

\Q..\\之间的所有字符\\ E 被看作文字字符,像 []这样的特殊字符不需要转义。 (这只是让模式更具可读性的一个技巧)。在CDATA中允许的内容与HTML注释中的内容相同。



text
[^ <++] 所有字符,直到打开尖括号或字符串的结尾。



标记
这是最有趣的子模式。第1行和第3行是开始和结束标记。请注意,在第1行中,标签名称由捕获组捕获。在第3行中, \g {-1} 是指最后定义的捕获组所匹配的内容(-1表示左边一个)。<第2行描述了开始标签和结束标签之间的可能内容。您可以看到,此描述不仅使用之前定义的子模式,而且使用当前子模式本身来允许嵌套标记。



一旦所有项目都已设置并且定义部分关闭,可以轻松写出主要图案。


I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags.

Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>.

//expected result

Don't wrap me<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>Don't wrap me <code><h1>End</h1></code>.

Is this possible?

解决方案

It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:

$html = <<<EOD
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<self>    < [^\W_]++ [^>]* > )
    (?<comment> <!-- (?>[^-]++|-(?!->))* -->)
    (?<cdata>   \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
    (?<text>    [^<]++ )
    (?<tag>
        < ([^\W_]++) [^>]* >
        (?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
        </ \g{-1} >
    )
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;

$html = preg_replace($pattern, '<code>$0</code>', $html);

echo htmlspecialchars($html);

The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.

(?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:

  • self: that describes a self-closing tag
  • comment: for html comments
  • cdata: for cdata
  • text: for text (all that is not a tag, a comment, or cdata)
  • tag: for html tags that are not self-closed

self:
[^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern.
[^>]* means all that is not a > zero or more times.

comment:
(?>[^-]++|-(?!->))* describes all the possible content inside an html comment:

(?>          # open an atomic group
    [^-]++   # all that is not a literal -, one or more times (possessive)
  |          # OR
    -        # a literal -
    (?!->)   # not followed by -> (negative lookahead)
)*           # close and repeat the group zero or more times 

cdata:
All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).
The content allowed in CDATA is described in the same way than the content in html comments.

text:
[^<]++ all characters until an opening angle bracket or the end of the string.

tag:
This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").
The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.

Once all items have been set and the definition section closed, you can easily write the main pattern.

这篇关于php - 在字符串中检测HTML并用代码标记包装的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆