仅当字符串中有小写字母时才匹配所有大写单词,并使用一个正则表达式 [英] Match all uppercase words only if there's a lowercase in the string, with one regex

查看:51
本文介绍了仅当字符串中有小写字母时才匹配所有大写单词,并使用一个正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶然发现了这个看似微不足道的问题,并被困在了这个问题上.我有一个字符串,如果字符串中的某处至少有一个小写字母,我想在其中匹配在一个正则表达式中的所有大写单词.

I stumbled upon this seemingly trivial question, and I'm stuck on it. I have a string, in which I want to match in one regex all uppercase words only if somewhere in the string there's at least a lowercase letter.

基本上,我希望每一行(我们可以考虑将正则表达式分别应用于每一行,不需要一些多行处理)输出:

Basically, I want each of these lines (we can consider I'll apply the regex to each line separately, no need for some multiline handling) to output:

ab ABC          //matches or captures ABC
ab ABC 12 CD    //matches or captures ABC, CD
ABC DE          //matches or captures nothing (no lowercase)
ABC 23 DE EFG a //matches or captures ABC, DE, EFG
AB aF DE        //matches or captures AB, DE

我使用 PCRE 作为正则表达式风格(我知道其他一些风格允许可变长度的后视).

I am using PCRE as regex flavor (I know some other flavors allow for variable length look-behind).

评论后更新

显然,如果我使用多个正则表达式或我用来调用正则表达式的程序语言,有很多简单的解决方案(例如,首先通过查找小写字母来验证字符串,然后将所有大写单词与两个不同的正则表达式匹配).

Obviously, there are lots of easy solutions if I use multiple regex or the program language I'm using to call the regex (e.g. first validate the string by looking for a lowercase letter then match all uppercase words with two different regex).

我的目标是找到一种方法来使用一个正则表达式.

My goal here is to find a way to do it with one regex.

我对这个限制没有技术要求.如果你有必要,或者好奇心,或者我试图提高我的正则表达式技能,把它作为一种风格练习:任务看起来(起初)很简单,我想知道是否有一个正则表达式一个人就可以实现.如果不能,我想了解原因.

I have no technical imperative for this constraint. Take it as an exercise of style if you have to, or curiosity, or me trying to up my regex skills: the task seemed (at first) so simple that I'd like to know if one regex alone can achieve it. If it can't, I'd like to understand why.

或者如果它可以但是正则表达式不是为这些类型的任务设计的,我希望我知道为什么 - 或者至少什么是这些不适合的任务",这样我就可以在我选择正确的解决方案时认识他们.

Or if it can but regex aren't designed for these kind of tasks, I wish I'd know why - or at least what are "these kind of unsuited tasks", so that I can choose the right solution when I meet them.

那么,在一个正则表达式中是否可行?

So, is it doable in one regex?

推荐答案

更新
所以 \G 最初设置为位置 0 处的匹配条件.
这意味着在多行模式下,BOS 必须是一个特例.
即使 BOString 是 BOLine,如果断言 (?= ^ .* [a-z] ) 失败,
\G 最初设置为匹配(默认?),并且未经验证就找到了 UC 词.

Update
So \G initially is set to a matched condition at position 0.
Which means in multi-line mode, BOS has to be a special case.
Even though BOString is a BOLine, if the assertion (?= ^ .* [a-z] ) fails,
\G is initially set as matched (default?) and UC words are found without being validated.

(?|(?=\A.*[a-z]).*?\b([A-Z]+)\b|(?!\A)(?:(?=^.*[a-z])|\G.*?\b([A-Z]+)\b))

更新 2 发布给后代.
经过与@Robin 的一些讨论,上面的正则表达式可以重构为:

Update 2 Posted for posterity.
After some discussion with @Robin, the above regex can be refactored to this:

 #  (?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b

 (?:
      (?= ^ .* [a-z] )        # BOL, check if line has lower case letter
   |                        # or
      (?! \A )                # Not at BOS (beginning of string, where \G is in a matched state)
      \G                      # Start the match at the end of last match (if previous matched state)
 )
 .*? \b 
 ( [A-Z]+ )              # (1), Found UC word
 \b     

Perl 测试用例:

$/ = undef;

$str = <DATA>;

@ary = $str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg;

print "@ary", "\n-------------\n";

while ($str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg)
{
   print "$1 ";
}

__DATA__
DA EFR
ab ABC
ab ABC 12 CD
ABC DE  t
ABC 23 DE EFG a

输出>>

ABC ABC CD ABC DE ABC DE EFG
-------------
ABC ABC CD ABC DE ABC DE EFG

这篇关于仅当字符串中有小写字母时才匹配所有大写单词,并使用一个正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆