正则表达式搜索避免嵌套结果 [英] Regular expression search avoid nested results
问题描述
我的文档包含几个代码块实例,如下所示:
My document contains several instance of code blocks looking like:
{% highlight %}
//some code
{% endhighlight %}
在Atom.io中,我试图编写一个正则表达式搜索来捕获这些内容.
In Atom.io, I am trying to write a regex search to capture those.
我的第一次尝试是:
{% highlight .* %}([\S\s]+){% endhighlight %}
My first try was:
{% highlight .* %}([\S\s]+){% endhighlight %}
问题在于,同一文档中有多个代码块,它还会捕获第一个代码块,直到最后一个代码块,而且全部匹配.
The problem is because there are several code blocks in the same document, it also catches the first code block until the last one, all in one match.
我虽然排除了{
字符:
{% highlight .* %}([^\{]+){% endhighlight %}
I though to exclude the {
character:
{% highlight .* %}([^\{]+){% endhighlight %}
但是问题是某些代码块包含有效的{
字符(例如function(){ ... }
).
But the problem is that some of the code blocks contain valid {
characters (such as function(){ ... }
).
推荐答案
Karthik的惰性匹配解决方案的问题是,当您在{% highlight %}
和{% end highlight %}
之间有较大的子字符串时,[\s\S]*?
将存储越来越多的文本进入最终可能会溢出的回溯缓冲区.
The problem with Karthik's lazy matching solution is that when you have large substrings between {% highlight %}
and {% end highlight %}
the [\s\S]*?
will be storing more and more text into the backtracking buffer that can eventually overrun.
使用 展开循环 技术,您可以避免这种情况:
Using an unrolling-the-loop technique, you can avoid that:
{% highlight %}([^{]*(?:{(?!% endhighlight %})[^{]*)*){% endhighlight %}
请参见 regex演示
这样,突出显示块内的子字符串可以是任意长度,并且性能将保持很快.
This way, the substrings inside the highlight blocks can be of any length and performance will stay fast.
正则表达式主要部分:
-
{% highlight %}
-从字面上匹配{% highlight %}
文本 -
([^{]*(?:{(?!% endhighlight %})[^{]*)*)
-将与{% endhighlight %}
不匹配的所有内容匹配并将其捕获到组1中:-
[^{]*
-除{
之外的0个或更多字符
-
(?:{(?!% endhighlight %})[^{]*)*
-0或更多序列....-
{(?!% endhighlight %})
-文字{
后面没有% endhighlight %}
-
[^{]*
-除{
之外的0个或更多字符
{% highlight %}
- matches the{% highlight %}
text literally([^{]*(?:{(?!% endhighlight %})[^{]*)*)
- matches and captures into group 1 everything that is not{% endhighlight %}
matching:[^{]*
- 0 or more characters other than{
(?:{(?!% endhighlight %})[^{]*)*
- 0 or more sequences of....{(?!% endhighlight %})
- literal{
not followed by% endhighlight %}
[^{]*
- 0 or more characters other than{
这与
{% highlight %}([\s\S]*?){% endhighlight %}
基本上相同,但是未包装".线性执行可确保更安全,更快的用户体验.This is basically the same as
{% highlight %}([\s\S]*?){% endhighlight %}
, but "unwraped". The linear execution ensures safer and faster user experience.这篇关于正则表达式搜索避免嵌套结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-
-