使用正则表达式删除相同类型的html嵌套标签的最终解决方案？ [英] Final solution for using regex to remove html nested tags of the same type?

查看：111 发布时间：2017/6/25 5:10:35 php html regex dom nested

本文介绍了使用正则表达式删除相同类型的html嵌套标签的最终解决方案？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经试图找到一个正则表达式的解决方案（在有人说之前：我知道我应该使用PHP DOM文档库或类似的东西，但让我们把它作为一个理论问题），寻找答案和我终于想出了在这个问题结束之前我会显示的内容。

以下是以前尝试过的很多事情的总结。 / p>

首先，我的意思是同一类型的嵌套标签是：

 任何div外的文字
< div id =my_id> bla bla 
< div> 
 bla bla bla 
< div style =some style here> 
 lalalalala 
< / div> 
< / div> 
我被困在一个div！ 
< / div> 
更多文字外面的divs 
 
< div>更多divs在这里！ 
< div id =justbeingannoying>无线电头规则< / div> 
< / div>

现在想象我想使用正则表达式删除所有的div 和他们的内容 。因此，预期的结果将是：

 任何div外的文本
更多文本外部divs

第一个想法将匹配所有内容。以下正则表达式与属性（style，id等）的div标签相匹配：

  /< div [^>] * >。*< \ / div> / sig

当然，问题是这将匹配第一个< div和最后一个< / div>开始之间的所有，因此它将匹配更多的div外面的文本（请查看这里： https://regex101.com/r/iR8mY2/1 ），这不是我们（I）想。

这可以使用 U修饰符（Ungreedy）解决

  /< div [^>] *>。*< \ / div> / sigU

但是我们会遇到比我们想要的更少的的问题：它只匹配第一个< div直到第一个（所以，如果我们删除匹配，除了一些不匹配的标签，将会有文本我被困在一个div！，我们不想要的）。

所以，我发现一个解决方案像嵌套括号，方括号等一样的魅力：

  / \ [（[^ \ [\]] * + |（？R））* \] / si 
  pre> 
 
 基本上，这样做是找到一个开放的方括号，然后匹配任何不是开放的或闭合的方括号*或其递归结构，找到一个关闭的方括号。
 
 
 我现在工作的是一个坏的解决方案：基本上，首先我用一个方括号替换所有的开放标签（不能在我的代码中，由于其他原因），然后关闭方括号的结束标签，然后我使用之前的正则表达式。  
 
 
 这件事我真的很想知道这个可能是什么只用一个正则表达式。看起来很明显，在前面的正则表达式中，使用html标签 可以使用[和]。 
但不是那么容易。问题是字符的否定（[^ .......]不适用于像div这样的字符串，似乎可以通过以下方式实现类似的功能：
 。+？（？=< div>）
  
，当然，关闭标签也是一样的
 。+？（？=< \\ \\ / div> 
  
这是或多或少的到达这个正则表达式
  /< div（（。+？（？=< \ / div>）） ;））|（？R））*< \ / div> / gis 
  
正如我之前呈现的第一个正则表达式： https://regex101.com/r / yU8pV3 / 1  
 
 
 所以，这是我的问题：正则表达式有什么问题？
 
 
 谢谢！
解决方案
 
免责声明
 
 
由于这个问题得到了积极的反应，我会发表一个解释你的方法是什么问题，并且将显示如何匹配不是某些特定文本的文本。
 
 
  不过，我想强调： 不要使用它来解析真正的任意HTML代码，因为正则表达式只能用于纯文本。  
 
 
 您的正则表达式有什么问题
 
 
 您的正则表达式包含< div（（。+？（？=< \ / div> ;）+。（？=< div>））|（？R））* 部分（与< div（（。+？在匹配关闭< \ / div> 部分之前< \ /？div>））|（？R））*  。当你有一些分隔的文本时，不要依赖于简单的懒惰/贪心点匹配（除非用于展开循环结构 - 当你知道你在做什么）。它是这样的：
 
 
  
  < div   - 匹配< div 由于缺少单词边界或 \s，字面上也是（<多样化 之后）
 
  （  - 匹配的组1：
 
  
  （。+？（？=< \ / div>）|。+？（？=< div>）） - 匹配任何1+字符（尽可能少），直到第一个< / div> 或第一个< div>  
 
   |  
 
   R）  - 重复（即插入和使用）
 
 
 
 
  ）* code>  - 重复第1组零次或更多次。
 
 
 
 
 问题很清楚： 。+？（？=< \ /？div>））部分不排除匹配< div> 或< / div> ，此分支必须仅将文本NOT EQUAL与前导和后跟分隔符相匹配。
 
 
 解决方案
 
 
 要匹配某些特定文字以外的文字，请使用 回调贪婪令牌 。
 < div \b [^<] *>（（？：（？！< \ /？div \\ ）。+ |（？R））*< \ / div> \s * 
 ^^^^^^^^^^^^^^^^^^ 
  
请参阅 regex demo 。请注意，您必须使用DOTALL修饰符，以便能够在换行符之间匹配文本。捕获组是多余的，您可以将其删除。
 
 
 这里重要的是（？：（？！< \ /？ div \b）。）+ 仅匹配1个或多个不是< div ....> 或< / div 序列。看到我上面链接的线程是如何工作的。
 
 
 对于性能，温和的贪心令牌是耗资源的。解开循环技术来解决问题：
 < div \b [^<] *& [^] +（？：<（?! \ /？div \b）[^ *] * |（？R））*< \ / div> \s * 
  
请参阅这个正则表达式演示 
 
 
 现在，令牌看起来像 [^ <] +（？：< （？！\ /？div \b）[^<] *）* ：除 之外的1+个字符     c $ c>（作为一个整体），然后再次0+非 -    s。
 
 
  < div \b 可能仍然在< div-tmp 中匹配，所以也许， < div（？：\s |>）是通过正则表达式来处理这个问题的更好方法。尽管如此，使用 DOM 解析HTML更容易。
 
I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.

What follows is just a summary of a lot of things I've tried before.

First of all, what I mean by nested tags of the same type is:
Text outside any div
<div id="my_id"> bla bla
  <div>
  bla bla bla
    <div style="some style here">
      lalalalala
     </div>
   </div>
    I'm trapped in a div!
</div>
more text outside divs

<div>more divs here!
       <div id="justbeingannoying">radiohead rules</div>
</div>
Now imagine I want to remove all the divs and their content using regex. So the intended result would be:
Text outside any div
more text outside divs
The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):
/<div[^>]*>.*<\/div>/sig
The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.  

This could be solved using the U modifier (Ungreedy) 
/<div[^>]*>.*<\/div>/sigU
but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want). 

So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:
/\[([^\[\]]*+|(?R))*\]/si
Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.

What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.

The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. 
But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:
.+?(?=<div>)
and, of course, the same for the closing tag
.+?(?=<\/div>
This is how, more or less, I arrived to this regex
/<div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*<\/div>/gis
Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1 

So, here is my question: what is wrong with that regex? 

Thank you!
 解决方案 
DISCLAIMER

Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

What is wrong with your regex

Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=<\/?div>))|(?R))*) before matching the closing <\/div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:


<div - match <div literally (also, in <diverse due to a missing word boundary or a \s after it)
( - Group 1 that matches:


(.+?(?=<\/div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
| 
(?R) - Recurse (i.e. insert and use)

)* - repeat Group 1 zero or more times.


The problem is clear: the (.+?(?=<\/?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

Solution(s)

To match text other than some specific text use a tempered greedy token.
<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
             ^^^^^^^^^^^^^^^^^^^ 
See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines.  A capturing group is redundant, you can remove it.

What is important here is that (?:(?!<\/?div\b).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:
<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*
See this regex demo

Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

<div\b might still match in <div-tmp, so perhaps, <div(?:\s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.

                        这篇关于使用正则表达式删除相同类型的html嵌套标签的最终解决方案？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用正则表达式删除相同类型的html嵌套标签的最终解决方案？ [英] Final solution for using regex to remove html nested tags of the same type?

问题描述

免责声明

您的正则表达式有什么问题

解决方案

DISCLAIMER

What is wrong with your regex

Solution(s)

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

使用正则表达式删除相同类型的html嵌套标签的最终解决方案？ [英] Final solution for using regex to remove html nested tags of the same type?

问题描述

免责声明

您的正则表达式有什么问题

解决方案

DISCLAIMER

What is wrong with your regex

Solution(s)

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭