) 是通过正则表达式处理此问题的更好方法.尽管如此,
使用DOM解析HTML要容易得多.
I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.
What follows is just a summary of a lot of things I've tried before.
First of all, what I mean by nested tags of the same type is:
Text outside any div
<div id="my_id"> bla bla
<div>
bla bla bla
<div style="some style here">
lalalalala
</div>
</div>
I'm trapped in a div!
</div>
more text outside divs
<div>more divs here!
<div id="justbeingannoying">radiohead rules</div>
</div>
Now imagine I want to remove all the divs and their content using regex. So the intended result would be:
Text outside any div
more text outside divs
The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):
/<div[^>]*>.*</div>/sig
The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.
This could be solved using the U modifier (Ungreedy)
/<div[^>]*>.*</div>/sigU
but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).
So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:
/[([^[]]*+|(?R))*]/si
Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.
What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.
The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work.
But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:
.+?(?=<div>)
and, of course, the same for the closing tag
.+?(?=</div>
This is how, more or less, I arrived to this regex
/<div((.+?(?=</div>)|.+?(?=<div>))|(?R))*</div>/gis
Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1
So, here is my question: what is wrong with that regex?
Thank you!
解决方案
DISCLAIMER
Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.
HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.
What is wrong with your regex
Your regex contains <div((.+?(?=</div>)|.+?(?=<div>))|(?R))*
part (same as <div((.+?(?=</?div>))|(?R))*
) before matching the closing </div>
part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:
<div
- match <div
literally (also, in <diverse
due to a missing word boundary or a s
after it)
(
- Group 1 that matches:
(.+?(?=</div>)|.+?(?=<div>))
- matches either any 1+ chars (as few as possible) up to the first </div>
or to the first <div>
|
(?R)
- Recurse (i.e. insert and use)
)*
- repeat Group 1 zero or more times.
The problem is clear: the (.+?(?=</?div>))
part does not exclude matching <div>
or </div>
, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.
Solution(s)
To match text other than some specific text use a tempered greedy token.
<div[^<]*>((?:(?!</?div).)+|(?R))*</div>s*
^^^^^^^^^^^^^^^^^^^
See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.
What is important here is that (?:(?!</?div).)+
only matches 1 or more characters that are not the starting character of a <div....>
or </div
sequences. See my above linked thread on how that works.
As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:
<div[^<]*>(?:[^<]+(?:<(?!/?div)[^<]*)*|(?R))*</div>s*
See this regex demo
Now, the token looks like [^<]+(?:<(?!/?div)[^<]*)*
: 1+ characters other than <
followed with 0+ sequences of <
that is not followed with /div
or div
(as a whole word) and then again 0+ non-<
s.
<div
might still match in <div-tmp
, so perhaps, <div(?:s|>)
is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.
这篇关于使用正则表达式删除相同类型的 html 嵌套标签的最终解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!