使用正则表达式删除相同类型的html嵌套标签的最终解决方案? [英] Final solution for using regex to remove html nested tags of the same type?

查看:111
本文介绍了使用正则表达式删除相同类型的html嵌套标签的最终解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经试图找到一个正则表达式的解决方案(在有人说之前:我知道我应该使用PHP DOM文档库或类似的东西,但让我们把它作为一个理论问题),寻找答案和我终于想出了在这个问题结束之前我会显示的内容。



以下是以前尝试过的很多事情的总结。 / p>

首先,我的意思是同一类型的嵌套标签是:

 任何div外的文字
< div id =my_id> bla bla
< div>
bla bla bla
< div style =some style here>
lalalalala
< / div>
< / div>
我被困在一个div!
< / div>
更多文字外面的divs

< div>更多divs在这里!
< div id =justbeingannoying>无线电头规则< / div>
< / div>

现在想象我想使用正则表达式删除所有的div 和他们的内容 。因此,预期的结果将是:

 任何div外的文本
更多文本外部divs

第一个想法将匹配所有内容。以下正则表达式与属性(style,id等)的div标签相匹配:

  /< div [^>] * >。*< \ / div> / sig 

当然,问题是这将匹配第一个< div和最后一个< / div>开始之间的所有,因此它将匹配更多的div外面的文本(请查看这里: https://regex101.com/r/iR8mY2/1 ),这不是我们(I)想。



这可以使用 U修饰符(Ungreedy)解决

  /< div [^>] *>。*< \ / div> / sigU 

但是我们会遇到比我们想要的更少的 的问题:它只匹配第一个< div直到第一个(所以,如果我们删除匹配,除了一些不匹配的标签,将会有文本我被困在一个div!,我们不想要的)。



所以,我发现一个解决方案像嵌套括号,方括号等一样的魅力:

  / \ [([^ \ [\]] * + |(?R))* \] / si 
pre>

基本上,这样做是找到一个开放的方括号,然后匹配任何不是开放的或闭合的方括号*或其递归结构,找到一个关闭的方括号。



我现在工作的是一个坏的解决方案:基本上,首先我用一个方括号替换所有的开放标签(不能在我的代码中,由于其他原因),然后关闭方括号的结束标签,然后我使用之前的正则表达式。



这件事我真的很想知道这个可能是什么只用一个正则表达式。看起来很明显,在前面的正则表达式中,使用html标签 可以使用[和]。
但不是那么容易。问题是字符的否定([^ .......]不适用于像div这样的字符串,似乎可以通过以下方式实现类似的功能:

 。+?(?=< div>)

,当然,关闭标签也是一样的

 。+?(?=< \\ \\ / div> 

这是或多或少的到达这个正则表达式

  /< div((。+?(?=< \ / div>)) ;))|(?R))*< \ / div> / gis 

正如我之前呈现的第一个正则表达式: https://regex101.com/r / yU8pV3 / 1



所以,这是我的问题:正则表达式有什么问题



谢谢!

解决方案

免责声明



由于这个问题得到了积极的反应,我会发表一个解释你的方法是什么问题,并且将显示如何匹配不是某些特定文本的文本。



不过,我想强调: 不要使用它来解析真正的任意HTML代码,因为正则表达式只能用于纯文本。



您的正则表达式有什么问题



您的正则表达式包含< div((。+?(?=< \ / div> ;)+。(?=< div>))|(?R))* 部分(与< div((。+?在匹配关闭< \ / div> 部分之前< \ /?div>))|(?R))* 。当你有一些分隔的文本时,不要依赖于简单的懒惰/贪心点匹配(除非用于展开循环结构 - 当你知道你在做什么)。它是这样的:




  • < div - 匹配< div 由于缺少单词边界或 \s,字面上也是(<多样化 之后)

  • - 匹配的组1:


    • (。+?(?=< \ / div>)|。+?(?=< div>)) - 匹配任何1+字符(尽可能少),直到第一个< / div> 或第一个< div>

    • |

    • R) - 重复(即插入和使用)


  • )* code> - 重复第1组零次或更多次。



问题很清楚: 。+?(?=< \ /?div>))部分不排除匹配< div> < / div> 此分支必须仅将文本NOT EQUAL与前导和后跟分隔符相匹配。



解决方案



要匹配某些特定文字以外的文字,请使用 回调贪婪令牌

 < div \b [^<] *>((?:(?!< \ /?div \\ )。+ |(?R))*< \ / div> \s * 
^^^^^^^^^^^^^^^^^^

请参阅 regex demo 。请注意,您必须使用DOTALL修饰符,以便能够在换行符之间匹配文本。捕获组是多余的,您可以将其删除。



这里重要的是(?:(?!< \ /? div \b)。)+ 仅匹配1个或多个不是< div ....> 或< / div 序列。看到我上面链接的线程是如何工作的。



对于性能,温和的贪心令牌是耗资源的。解开循环技术来解决问题:

 < div \b [^<] *& [^] +(?:<(?! \ /?div \b)[^ *] * |(?R))*< \ / div> \s * 

请参阅这个正则表达式演示



现在,令牌看起来像 [^ <] +(?:< (?!\ /?div \b)[^<] *)* :除 之外的1+个字符 c $ c>(作为一个整体),然后再次0+非 - s。



< div \b 可能仍然在< div-tmp 中匹配,所以也许, < div(?:\s |>)是通过正则表达式来处理这个问题的更好方法。尽管如此,使用 DOM 解析HTML更容易


I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.

What follows is just a summary of a lot of things I've tried before.

First of all, what I mean by nested tags of the same type is:

Text outside any div
<div id="my_id"> bla bla
  <div>
  bla bla bla
    <div style="some style here">
      lalalalala
     </div>
   </div>
    I'm trapped in a div!
</div>
more text outside divs

<div>more divs here!
       <div id="justbeingannoying">radiohead rules</div>
</div>

Now imagine I want to remove all the divs and their content using regex. So the intended result would be:

Text outside any div
more text outside divs

The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):

/<div[^>]*>.*<\/div>/sig

The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.

This could be solved using the U modifier (Ungreedy)

/<div[^>]*>.*<\/div>/sigU

but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).

So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:

/\[([^\[\]]*+|(?R))*\]/si

Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.

What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.

The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:

.+?(?=<div>)

and, of course, the same for the closing tag

.+?(?=<\/div>

This is how, more or less, I arrived to this regex

/<div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*<\/div>/gis

Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1

So, here is my question: what is wrong with that regex?

Thank you!

解决方案

DISCLAIMER

Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

What is wrong with your regex

Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=<\/?div>))|(?R))*) before matching the closing <\/div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:

  • <div - match <div literally (also, in <diverse due to a missing word boundary or a \s after it)
  • ( - Group 1 that matches:
    • (.+?(?=<\/div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
    • |
    • (?R) - Recurse (i.e. insert and use)
  • )* - repeat Group 1 zero or more times.

The problem is clear: the (.+?(?=<\/?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

Solution(s)

To match text other than some specific text use a tempered greedy token.

<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
             ^^^^^^^^^^^^^^^^^^^ 

See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.

What is important here is that (?:(?!<\/?div\b).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:

<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*

See this regex demo

Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

<div\b might still match in <div-tmp, so perhaps, <div(?:\s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.

这篇关于使用正则表达式删除相同类型的html嵌套标签的最终解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆