PHP:如何删除嵌套标签,然后以未嵌套的方式重新放置它们? [英] PHP: How do I remove nested tags, and relocate them in an un-nested way?

查看:79
本文介绍了PHP:如何删除嵌套标签,然后以未嵌套的方式重新放置它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从字符串中删除所有出现的bb样式标签.标签可以嵌套,这就是我失败的地方.我还需要将每个标签和内容重新定位到字符串的末尾,然后将标签替换为HTML元素.我曾尝试过使用正则表达式和preg_replace_callback,但到目前为止我一直都没有成功.我也尝试修改以下内容,但也没有运气: 在PHP中删除嵌套的bbcode(报价)如何删除html元素并它的内容使用RegEx ,我认为我不能使用HTML解析器像这样,因为HTML格式不正确(元素中不能包含子元素的子元素).

I need to remove all occurrences of a bb style tag from a string. The tags can be nested, and this is where I am failing. I also need to relocate each tag and contents to the end of the string, and replace the tag with an HTML element. I have tried to play with regex and preg_replace_callback, but I have only been so far unsuccessful. I also tried to modify the following, and have also had no luck: Removing nested bbcode (quotes) in PHP and How can I remove an html element and it's contents using RegEx I don't think I can use an HTML parser like this because the HTML is malformed (children in elements that can't have children).

这是字符串的样子:

This is some 
[tag] attribute=1 attribute2=1 
     [tag] attribute=1 attribute2=1 [/tag] 
     [tag] attribute=1 attribute2=1 [/tag]
[/tag]
 text.

结果应如下所示:

This is some text.
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>

任何帮助将不胜感激.

推荐答案

街道信誉:我曾为UBBCode的创建者Infopop(后称为Groupee,现为Social Strata)工作,该东西被复制并转化为纯文本旧的常规"BBCode".

Street cred: I worked for Infopop (later known as Groupee, now Social Strata), the creators of UBBCode, the thing that was copied and transformed into just plain old regular "BBCode."

tl; dr:是时候编写自己的非正则表达式解析器了.

tl;dr: Time to write your own non-regex parser.

大多数BBCode解析器使用正则表达式,并且在大多数情况下都可以使用,但是您在这里做了一些自定义操作.普通的旧正则表达式不会帮助您.正则表达式有两种操作方式:我们可以在贪婪"模式下匹配两个标签之间的所有内容,也可以在非贪婪"模式下匹配两个标签之间的所有内容.

Most BBCode parsers use regexes, and that works for most cases, but you're doing something custom here. Plain old regular expressions are not going to help you. Regexes have two modes of operation that get in our way: we can either match everything between two tags in "greedy" mode, or in "not greedy" mode.

在贪婪"模式下,我们将捕获第一个打开任务和最后一个结束标记之间的所有内容.这使事情变得糟透了.以这种情况为例:

In "greedy" mode, we'll capture everything between the very first opening task and the very last closing tag. This breaks things horribly. Take this case:

[a][b][c]...[/c][/b][/a]...[a]...[/a]

\[a\].+\[/a\]这样的贪婪的正则表达式将从第一个开始标签到最后一个最后一个结束标签的所有内容都被抓住,而忽略了关闭器并没有关闭打开器的事实.

A greedy regex like \[a\].+\[/a\] is going to grab everything from that first opening tag to that last closing tag, ignoring the fact that the closer isn't closing the opener.

另一个选择更糟.以这种情况为例:

The other option is worse. Take this case:

[a][b][a]...[/a][/b][/a]

\[a\].+?\[/a\]这样的不愉快的正则表达式(唯一的变化是问号)将与第一个开始标记匹配,但随后它将与第一个结束标记匹配,再次忽略了结束标记不属于开头标签.

An ungreedy regex like \[a\].+?\[/a\] (the only change is the question mark) is going to match the first opening tag, but then it'll match the first closing tag, again ignoring that the closing tag doesn't belong to the opening tag.

我这样解决的方法,早在原始时代就是完全忽略,即开始和结束标记不匹配的事实.我只是循环了标记转换正则表达式的整个链,直到输出停止更改为止.它简单有效,主要是因为可用标签集是有意限制的,因此嵌套绝不是问题.

The way I solved this way, way back in the primitive days was to completely ignore the fact that the opening and closing tags didn't match. I simply looped the entire chain of tag transformation regexes until the output stopped changing. It was simple and effective, mainly because the available tag set was intentionally limited, so nesting was never an issue.

在您允许嵌套相同标签的瞬间,盲目,蛮力不再是合适的工具.

The instant you allow nesting of identical tags, blind, brute force is no longer a suitable tool.

如果没有BBCode解析引擎对您有用,则您可能需要编写自己的.将所有全部都签出.在PEAR上有一些,有PECL扩展,等等.还要检查其他语言是否有启发,Perl的CPAN有十二种不同的实现,其中一些非常强大和复杂(如果在这种混合中没有适当的递归下降解析器, ,我会感到沮丧).这是一个很好的挑战,但并不难.再说一遍,我现在写的像《 五个》(我都不能发行),所以也许我有偏见?

If none of the BBCode parsing engines out there are going to work for you, you might have to write your own. Check all of them out. There are some on PEAR, there's a PECL extension, etc. Also check other languages for inspiration, Perl's CPAN has a dozen different implementations, some of which are very powerful and complex (if there isn't a proper recursive descent parser in that mix, I'll be depressed). This is a good challenge, but it's not too hard. Then again, I've written like five now (none of which I can release), so maybe I'm biased?

首先将[]上的字符串展开.遍历生成的数组,跟踪何时在左括号之后和下一个括号之前的数组索引看起来像是有效的标记和/或属性.您将需要考虑当属性可以包含方括号时,或更糟的是,如果URL重括号(例如PHP数组语法),会发生什么情况.您还需要一般性地考虑属性,包括如何(如果?),是否每个标签允许多个属性(如您的示例)以及如何处理无效属性.

Start by exploding the string on [ and ]. Go through the resulting array, keeping track of when the array index following the opening bracket and before the next closing bracket happens to look like a valid tag and/or attributes. You're going to need to think about what happens when an attribute can contain a bracket, or worse, are URLs that are bracket-heavy (like PHP array syntax). You'll also need to think about attributes in general, including how (if?) they are quoted, if multiple attributes per tag are allowed (as in your example), and what to do with invalid attributes.

当您继续处理字符串时,还需要跟踪打开的标签以及打开的顺序.您必须考虑其他标签中允许使用哪些标签.您还必须处理嵌套错误,例如[a][b][/a][/b].您的选择是在外部关闭后重新打开内部标签,或在外部关闭后立即关闭内部标签.更糟糕的是,根据情况,不同的行为可能有意义.更糟糕的是,[list]中的[*]这样古怪的标签,传统上没有关闭标签!

As you continue to process the string, you will also need to keep track of what tags are open, and in what order. You'll have to think about what tags are permitted inside other tags. You'll also have to deal with mis-nesting, like [a][b][/a][/b]. Your options will be either re-opening the inner tag after the outer closes, or closing the inner as soon as the outer does. Worse, different behavior might make sense depending on the situation. Worse-worse are wacky tags like [*] inside [list], which traditionally doesn't have a closing tag!

一旦处理完字符串并创建了一个打开和关闭标签列表(并可能重新平衡了打开和关闭标签),则可以将结果转换为HTML,或者最终得到任何输出.这是将这些特定标记的输出移至新文档末尾的时间和方式.

Once you've processed the string and have created a list of open and closing tags (and possibly re-balanced the opens and closes), then you can transform the result into HTML, or whatever your output ends up being. This is when and how you'd move the output of those specific tags to the end of the new document.

完成后,编写一千个测试用例.尝试将其破坏,将其炸成小小的碎片,产生XSS漏洞,否则将竭尽全力使自己的生活陷入困境.这将是值得的,因为结果将是一个BBCode引擎,它将执行您要尝试执行的操作.

Once you've finished up, write a thousand test cases. Try to break it, blow it into itty bitty chunks, produce XSS vulnerabilities, and otherwise do your best to make your life hell. It will be worth it, because the result will be a BBCode engine that will do what you're trying to do.

这篇关于PHP:如何删除嵌套标签,然后以未嵌套的方式重新放置它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆