使用正则表达式删除相同类型的html嵌套标签的最终解决方案? [英] Final solution for using regex to remove html nested tags of the same type?
问题描述
我已经试图找到一个正则表达式的解决方案(在有人说之前:我知道我应该使用PHP DOM文档库或类似的东西,但让我们把它作为一个理论问题),寻找答案和我终于想出了在这个问题结束之前我会显示的内容。
以下是以前尝试过的很多事情的总结。 / p>
首先,我的意思是同一类型的嵌套标签是:
任何div外的文字
< div id =my_id> bla bla
< div>
bla bla bla
< div style =some style here>
lalalalala
< / div>
< / div>
我被困在一个div!
< / div>
更多文字外面的divs
< div>更多divs在这里!
< div id =justbeingannoying>无线电头规则< / div>
< / div>
现在想象我想使用正则表达式删除所有的div 和他们的内容 。因此,预期的结果将是:
任何div外的文本
更多文本外部divs
第一个想法将匹配所有内容。以下正则表达式与属性(style,id等)的div标签相匹配:
/< div [^>] * >。*< \ / div> / sig
当然,问题是这将匹配第一个< div和最后一个< / div>开始之间的所有,因此它将匹配更多的div外面的文本(请查看这里: https://regex101.com/r/iR8mY2/1 ),这不是我们(I)想。
这可以使用 U修饰符(Ungreedy)解决
/< div [^>] *>。*< \ / div> / sigU
但是我们会遇到比我们想要的更少的 的问题:它只匹配第一个< div直到第一个(所以,如果我们删除匹配,除了一些不匹配的标签,将会有文本我被困在一个div!,我们不想要的)。
所以,我发现一个解决方案像嵌套括号,方括号等一样的魅力:
/ \ [([^ \ [\]] * + |(?R))* \] / si
pre>
基本上,这样做是找到一个开放的方括号,然后匹配任何不是开放的或闭合的方括号*或其递归结构,找到一个关闭的方括号。
我现在工作的是一个坏的解决方案:基本上,首先我用一个方括号替换所有的开放标签(不能在我的代码中,由于其他原因),然后关闭方括号的结束标签,然后我使用之前的正则表达式。
这件事我真的很想知道这个可能是什么只用一个正则表达式。看起来很明显,在前面的正则表达式中,使用html标签 可以使用[和]。
但不是那么容易。问题是字符的否定([^ .......]不适用于像div这样的字符串,似乎可以通过以下方式实现类似的功能:。+?(?=< div>)
,当然,关闭标签也是一样的
。+?(?=< \\ \\ / div>
这是或多或少的到达这个正则表达式
/< div((。+?(?=< \ / div>)) ;))|(?R))*< \ / div> / gis
正如我之前呈现的第一个正则表达式: https://regex101.com/r / yU8pV3 / 1
所以,这是我的问题:正则表达式有什么问题?
谢谢!
解决方案
免责声明
由于这个问题得到了积极的反应,我会发表一个解释你的方法是什么问题,并且将显示如何匹配不是某些特定文本的文本。
不过,我想强调: 不要使用它来解析真正的任意HTML代码,因为正则表达式只能用于纯文本。
您的正则表达式有什么问题
您的正则表达式包含
< div((。+?(?=< \ / div> ;)+。(?=< div>))|(?R))*
部分(与< div((。+?在匹配关闭
。当你有一些分隔的文本时,不要依赖于简单的懒惰/贪心点匹配(除非用于展开循环结构 - 当你知道你在做什么)。它是这样的:< \ / div>
部分之前< \ /?div>))|(?R))*
-
< div
- 匹配< div
由于缺少单词边界或\s,字面上也是(
之后)<多样化
-
(
- 匹配的组1:
-
(。+?(?=< \ / div>)|。+?(?=< div>))
- 匹配任何1+字符(尽可能少),直到第一个< / div>
或第一个< div>
-
|
-
R)
- 重复(即插入和使用)
-
-
)* code> - 重复第1组零次或更多次。
问题很清楚: 。+?(?=< \ /?div>))
部分不排除匹配< div>
或< / div>
,此分支必须仅将文本NOT EQUAL与前导和后跟分隔符相匹配。
解决方案
要匹配某些特定文字以外的文字,请使用 回调贪婪令牌 。
< div \b [^<] *>((?:(?!< \ /?div \\ )。+ |(?R))*< \ / div> \s *
^^^^^^^^^^^^^^^^^^
请参阅 regex demo 。请注意,您必须使用DOTALL修饰符,以便能够在换行符之间匹配文本。捕获组是多余的,您可以将其删除。
这里重要的是(?:(?!< \ /? div \b)。)+
仅匹配1个或多个不是< div ....>
或< / div
序列。看到我上面链接的线程是如何工作的。
对于性能,温和的贪心令牌是耗资源的。解开循环技术来解决问题:
< div \b [^<] *& [^] +(?:<(?! \ /?div \b)[^ *] * |(?R))*< \ / div> \s *
请参阅这个正则表达式演示
现在,令牌看起来像 [^ <] +(?:< (?!\ /?div \b)[^<] *)*
:除
之外的1+个字符 c $ c>(作为一个整体),然后再次0+非 -
s。
< div \b
可能仍然在< div-tmp
中匹配,所以也许, < div(?:\s |>)
是通过正则表达式来处理这个问题的更好方法。尽管如此,使用 DOM 解析HTML更容易。
I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.
What follows is just a summary of a lot of things I've tried before.
First of all, what I mean by nested tags of the same type is:
Text outside any div
<div id="my_id"> bla bla
<div>
bla bla bla
<div style="some style here">
lalalalala
</div>
</div>
I'm trapped in a div!
</div>
more text outside divs
<div>more divs here!
<div id="justbeingannoying">radiohead rules</div>
</div>
Now imagine I want to remove all the divs and their content using regex. So the intended result would be:
Text outside any div
more text outside divs
The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):
/<div[^>]*>.*<\/div>/sig
The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.
This could be solved using the U modifier (Ungreedy)
/<div[^>]*>.*<\/div>/sigU
but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).
So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:
/\[([^\[\]]*+|(?R))*\]/si
Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.
What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.
The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:
.+?(?=<div>)
and, of course, the same for the closing tag
.+?(?=<\/div>
This is how, more or less, I arrived to this regex
/<div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*<\/div>/gis
Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1
So, here is my question: what is wrong with that regex?
Thank you!
DISCLAIMER
Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.
HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.
What is wrong with your regex
Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*
part (same as <div((.+?(?=<\/?div>))|(?R))*
) before matching the closing <\/div>
part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:
<div
- match<div
literally (also, in<diverse
due to a missing word boundary or a\s
after it)(
- Group 1 that matches:(.+?(?=<\/div>)|.+?(?=<div>))
- matches either any 1+ chars (as few as possible) up to the first</div>
or to the first<div>
|
(?R)
- Recurse (i.e. insert and use)
)*
- repeat Group 1 zero or more times.
The problem is clear: the (.+?(?=<\/?div>))
part does not exclude matching <div>
or </div>
, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.
Solution(s)
To match text other than some specific text use a tempered greedy token.
<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
^^^^^^^^^^^^^^^^^^^
See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.
What is important here is that (?:(?!<\/?div\b).)+
only matches 1 or more characters that are not the starting character of a <div....>
or </div
sequences. See my above linked thread on how that works.
As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:
<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*
See this regex demo
Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*
: 1+ characters other than <
followed with 0+ sequences of <
that is not followed with /div
or div
(as a whole word) and then again 0+ non-<
s.
<div\b
might still match in <div-tmp
, so perhaps, <div(?:\s|>)
is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.
这篇关于使用正则表达式删除相同类型的html嵌套标签的最终解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!