要匹配的正则表达式模式,不包括何时.../之间除外 [英] Regex Pattern to Match, Excluding when... / Except between

查看:43
本文介绍了要匹配的正则表达式模式,不包括何时.../之间除外的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

--编辑-- 目前的答案有一些有用的想法,但我想要更完整的东西,我可以 100% 理解和重用;这就是我设置赏金的原因.对我来说,适用于任何地方的想法都比没有像 \K

这样的标准语法更好

这个问题是关于我如何匹配模式,除了某些情况 s1 s2 s3.我举了一个具体的例子来说明我的意思,但更喜欢一个我能 100% 理解的一般答案,这样我就可以在其他情况下重复使用它.

示例

我想使用 \b\d{5}\b 匹配五位数字,但不是在三种情况下 s1 s2 s3:

s1:不在以句号结尾的行上,如这句话.

s2:不在括号内的任何地方.

s3: 不在以 if( 开头并以 //endif

结尾的块内

我知道如何使用前瞻和后视解决 s1 s2 s3 中的任何一个,尤其是在 C# 后视或 PHP 中的 \K 中.

例如

s1 (?m)(?!\d+.*?\.$)\d+

s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b

s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+

但是条件混合在一起让我的头爆炸了.更坏的消息是,我可能需要在其他时间添加其他条件s4 s5.

好消息是,我不在乎我是否使用最常用的语言(如 PHP、C#、Python 或我邻居的洗衣机)处理文件.:) 我几乎是 Python 的初学者 &Java 但有兴趣了解它是否有解决方案.

所以我来这里是想看看是否有人想到了一个灵活的食谱.

提示没问题:你不需要给我完整的代码.:)

谢谢.

解决方案

汉斯,我会接受诱饵并充实我之前的答案.你说你想要更完整的东西",所以我希望你不会介意冗长的答案——只是想取悦你.让我们从一些背景开始.

首先,这是一个很好的问题.除了在某些上下文中(例如,在代码块内或括号内)之外,经常会出现关于匹配某些模式的问题.这些问题通常会产生相当尴尬的解决方案.因此,您关于多上下文的问题是一个特殊的挑战.

惊喜

令人惊讶的是,至少有一种有效的解决方案是通用的、易于实施且易于维护.它适用于所有正则表达式,允许您检查代码中的捕获组.它恰好回答了一些乍一看可能与您的不同的常见问题:匹配除甜甜圈以外的所有内容"、替换除...之外的所有内容"、匹配除我妈妈黑名单上的单词以外的所有单词"、忽略标签"、匹配温度,除非斜体"...

遗憾的是,这项技术并不为人所知:我估计在二十个可以使用它的 SO 问题中,只有一个答案提到了它——这意味着可能有五十分之一或六十个答案.在评论中查看我与 Kobi 的交流.

Perl/PCRE 变化

与上面的一般解决方案相反,Perl 和 PCRE 存在一种在 SO 上经常看到的变体,至少在诸如@CasimiretHippolyte 和 @HamZa 之类的正则表达式之神手中是这样.它是:

(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant

就你而言:

(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*跳过)(*F)|\b\d+\b

这个变体更容易使用,因为在上下文 s1、s2 和 s3 中匹配的内容被简单地跳过,所以您不需要检查组 1 捕获(注意括号已消失).匹配项仅包含 whatYouWant

注意 (*F)(*FAIL)(?!) 都是一回事.如果你想更晦涩,你可以使用 (*SKIP)(?!)

演示此版本

应用

以下是此技术通常可以轻松解决的一些常见问题.您会注意到,单词选择会使其中一些问题听起来不同,但实际上它们几乎相同.

  1. 除了像 <a stuff...>...</a> 这样的标签中的任何地方,我如何匹配 foo?
  2. 除了 <i> 标记或 javascript 代码段(更多条件)之外,我如何匹配 foo?
  3. 如何匹配不在此黑名单上的所有单词?
  4. 如何忽略 SUB...END SUB 块中的任何内容?
  5. 除了... s1 s2 s3 之外,我如何才能匹配所有内容?

如何编程第 1 组捕获

您不喜欢代码,但是,为了完成...检查第 1 组的代码显然取决于您选择的语言.无论如何,它不应向用于检查匹配项的代码添加多于几行.

如果有疑问,我建议您查看代码示例部分 前面提到的文章,它提供了很多语言的代码.

替代方案

根据问题的复杂程度以及所使用的正则表达式引擎,有多种替代方案.以下是适用于大多数情况的两种情况,包括多种情况.在我看来,两者都不如 s1|s2|s3|(whatYouWant) 配方那么有吸引力,因为清晰总是胜出.

1.替换然后匹配.

一个听起来很老套但在许多环境中运行良好的好的解决方案是分两步工作.第一个正则表达式通过替换潜在冲突的字符串来消除您想要忽略的上下文.如果你只想匹配,那么你可以用一个空字符串替换,然后在第二步运行你的匹配.如果你想替换,你可以先用一些独特的东西替换要忽略的字符串,例如用 @@@ 的固定宽度链围绕你的数字.替换后,您可以随意替换您真正想要的内容,然后您必须还原您独特的 @@@ 字符串.

2.环顾四周.

您的原始帖子表明您了解如何使用环视排除单个条件.你说 C# 非常适合这个,你是对的,但它不是唯一的选择.例如,在 C#、VB.NET 和 Visual C++ 中发现的 .NET regex 风格,以及在 Python 中替换 re 的仍在实验中的 regex 模块,是唯一的我知道的两个引擎支持无限宽度的后视.使用这些工具,一个回顾中的一个条件不仅可以照顾到后面,还可以关注比赛和比赛之外,避免与前瞻协调的需要.更多条件?更多环视.

在 C# 中回收用于 s3 的正则表达式,整个模式看起来像这样.

(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?

但现在你知道我不推荐这个,对吧?

删除

@HamZa 和 @Jerry 建议我在您试图删除 WhatYouWant 时提到一个额外的技巧.您还记得匹配 WhatYouWant(将其捕获到组 1)的配方是 s1|s2|s3|(WhatYouWant),对吗?要删除 WhatYouWant 的所有实例,请将正则表达式更改为

(s1|s2|s3)|WhatYouWant

对于替换字符串,您使用 $1.这里发生的事情是,对于每个匹配的 s1|s2|s3 实例,替换 $1 用它自己替换该实例(由 $1 引用>).另一方面,当 WhatYouWant 匹配时,它会被一个空组替换而没有其他东西——因此被删除.请参阅此演示,感谢@HamZa 和@Jerry 提出这个精彩的补充.

替换

这将我们带到替代品,我将简要介绍一下.

  1. 当什么都不替换时,请参阅上面的删除"技巧.
  2. 替换时,如果使用 Perl 或 PCRE,请使用上面提到的 (*SKIP)(*F) 变体来精确匹配您想要的内容,然后直接替换.
  3. 在其他版本中,在替换函数调用中,使用回调或 lambda 检查匹配项,如果设置了组 1,则替换.如果您需要这方面的帮助,已引用的文章将为您提供各种语言的代码.

玩得开心!

不,等等,还有更多!

啊,不,我会把它留作我的 20 卷回忆录,明年春天出版.

--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K

This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.

Example

I want to match five digits using \b\d{5}\b but not in three situations s1 s2 s3:

s1: Not on a line that ends with a period like this sentence.

s2: Not anywhere inside parens.

s3: Not inside a block that starts with if( and ends with //endif

I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K in PHP.

For instance

s1 (?m)(?!\d+.*?\.$)\d+

s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b

s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+

But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.

The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.

So I came here to see if someone think of a flexible recipe.

Hints are okay: you don't need to give me full code. :)

Thank you.

解决方案

Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.

First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.

Surprise

Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...

Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.

A Better-Known Variation

There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).

Thanks for all the background, zx81... But what's the recipe?

Key Fact

The method returns the match in Group 1 capture. It does not care at all about the overall match.

In fact, the trick is to match the various contexts we don't want (chaining these contexts using the | OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.

The general recipe is

Not_this_context|Not_this_either|StayAway|(WhatYouWant)

This will match Not_this_context, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.

In your case, with your digits and your three contexts to ignore, we can do:

s1|s2|s3|(\b\d+\b)

Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a | )

The whole expression can be written like this:

(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)

See this demo (but focus on the capture groups in the lower right pane.)

If you mentally try to split this regex at each | delimiter, it is actually only a series of four very simple expressions.

For flavors that support free-spacing, this reads particularly well.

(?mx)
      ### s1: Match line that ends with a period ###
^.*\.$  
|     ### OR s2: Match anything between parentheses ###
\([^\)]*\)  
|     ### OR s3: Match any if(...//endif block ###
if\(.*?//endif  
|     ### OR capture digits to Group 1 ###
(\b\d+\b)

This is exceptionally easy to read and maintain.

Extending the regex

When you want to ignore more situations s4 and s5, you add them in more alternations on the left:

s4|s5|s1|s2|s3|(\b\d+\b)

How does this work?

The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".

The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.

I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.

Debuggex Demo

Perl/PCRE Variation

In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as @CasimiretHippolyte and @HamZa. It is:

(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant

In your case:

(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b

This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant

Note that (*F), (*FAIL) and (?!) are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)

demo for this version

Applications

Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.

  1. How can I match foo except anywhere in a tag like <a stuff...>...</a>?
  2. How can I match foo except in an <i> tag or a javascript snippet (more conditions)?
  3. How can I match all words that are not on this black list?
  4. How can I ignore anything inside a SUB... END SUB block?
  5. How can I match everything except... s1 s2 s3?

How to Program the Group 1 Captures

You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.

If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.

Alternatives

Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant) recipe, if only because clarity always wins out.

1. Replace then Match.

A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of @@@. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive @@@ strings.

2. Lookarounds.

Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex module to replace re in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.

Recycling the regex you had for s3 in C#, the whole pattern would look like this.

(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b

But by now you know I'm not recommending this, right?

Deletions

@HamZa and @Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant. You remember that the recipe to match WhatYouWant (capturing it into Group 1) was s1|s2|s3|(WhatYouWant), right? To delete all instance of WhatYouWant, you change the regex to

(s1|s2|s3)|WhatYouWant

For the replacement string, you use $1. What happens here is that for each instance of s1|s2|s3 that is matched, the replacement $1 replaces that instance with itself (referenced by $1). On the other hand, when WhatYouWant is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you @HamZa and @Jerry for suggesting this wonderful addition.

Replacements

This brings us to replacements, on which I'll touch briefly.

  1. When replacing with nothing, see the "Deletions" trick above.
  2. When replacing, if using Perl or PCRE, use the (*SKIP)(*F) variation mentioned above to match exactly what you want, and do a straight replacement.
  3. In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.

Have fun!

No, wait, there's more!

Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.

这篇关于要匹配的正则表达式模式,不包括何时.../之间除外的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆