正则表达式仅匹配大写的“单词"除了一些例外 [英] Regex to match only uppercase "words" with some exceptions

查看:128
本文介绍了正则表达式仅匹配大写的“单词"除了一些例外的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的技术性字符串如下:

I have technical strings as the following:

"The thing P1 must connect to the J236 thing in the Foo position."

我想用一个仅大写的单词(即这里的P1J236)与一个正则表达式匹配.问题是,当它是一个字母的单词时,我不想匹配句子的第一个字母.

I would like to match with a regular expression those only-in-uppercase words (namely here P1 and J236). The problem is that I don't want to match the first letter of the sentence when it is a one-letter word.

示例,在:

"A thing P1 must connect ..." 

我只需要P1,而不是AP1.通过这样做,我知道我可以错过一个真实的单词"(例如在"X must connect to Y"中),但是我可以忍受它.

I want P1 only, not A and P1. By doing that, I know that I can miss a real "word" (like in "X must connect to Y") but I can live with it.

此外,如果句子全部为大写,我也不想匹配大写单词.

Additionally, I don't want to match uppercase words if the sentence is all uppercase.

示例:

"THING P1 MUST CONNECT TO X2."

当然,理想情况下,我想在此处匹配技术用语P1X2,但是由于它们在全部大写的句子中被隐藏",并且由于这些技术用语没有特定的模式,因此这是不可能的.同样,我可以忍受它,因为在我的文件中,全大写的句子不是那么频繁.

Of course, ideally, I would like to match the technical words P1 and X2 here but since they are "hidden" in the all-uppercase sentence and since these technical words have no specific pattern, it's impossible. Again I can live with it because all-uppercase sentences are not so frequent in my files.

谢谢!

推荐答案

在某种程度上,这将因所使用的RegEx的风味"而异.以下内容基于.NET RegEx,后者使用\b作为单词边界.在最后一个示例中,它还使用了否定的环视(?<!)(?!)以及不包含括号的(?:)

To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses \b for word boundaries. In the last example, it also uses negative lookaround (?<!) and (?!) as well as non-capturing parentheses (?:)

但是,基本上,如果术语始终包含至少一个大写字母后跟至少一个数字,则可以使用

Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use

\b[A-Z]+[0-9]+\b

对于所有大写字母和数字(总数必须为2个或更多):

For all-uppercase and numbers (total must be 2 or more):

\b[A-Z0-9]{2,}\b

用于全大写和数字,但至少以一个字母开头:

For all-uppercase and numbers, but starting with at least one letter:

\b[A-Z][A-Z0-9]+\b

祖父,返回具有大写字母和数字的任意组合,但在行的开头不是单个字母并且不属于全大写的行的项:

The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:

(?:(?<!^)[A-Z]\b|(?<!^[A-Z0-9 ]*)\b[A-Z0-9]+\b(?![A-Z0-9 ]$))

细分:

正则表达式以(?:开头. ?:表示-尽管后面是括号,但我对捕获结果不感兴趣.这称为非捕获括号".在这里,我使用的是parethese,因为我使用的是交替(见下文).

The regex starts with (?:. The ?: signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).

在不捕获的括号内,我有两个单独的子句,它们之间用管道符号|分隔.这是一种交替-就像一个或".正则表达式可以匹配第一个表达式.这两种情况是这是该行的第一个单词"还是其他所有内容",因为我们有特殊的要求,在行的开头排除一个字母的单词.

Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol |. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.

现在,让我们看看交替中的每个表达式.

Now, let's look at each expression in the alternation.

第一个表达式是:(?<!^)[A-Z]\b.这里的主要子句是[A-Z]\b,它是任意一个大写字母,后跟一个单词边界,可以是标点符号,空格,换行符等.在此之前的部分是(?<!^),它是负向后看".这是一个零宽度的断言,这意味着它不会消耗"字符作为匹配的一部分-在这里了解这一点并不是很重要. .NET中负向后看的语法为(?<!x),其中 x 是在主子句之前必须 not 不存在的表达式.在这里,该表达式只是^或行首,因此交替的这一面翻译为任何由单个大写字母组成的单词,该单词在行首不是 ."

The first expression is: (?<!^)[A-Z]\b. The main clause here is [A-Z]\b, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is (?<!^), which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is (?<!x), where x is the expression that must not exist before our main clause. Here that expression is simply ^, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."

好的,所以我们要匹配不在行首的一个字母的大写单词.我们仍然需要匹配由所有数字和大写字母组成的单词.

Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.

由第二个表达式中相对较小的一部分处理:\b[A-Z0-9]+\b. \b表示单词边界,[A-Z0-9]+将一个或多个数字和大写字母匹配在一起.

That is handled by a relatively small portion of the second expression in the alternation: \b[A-Z0-9]+\b. The \bs represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

表达式的其余部分由其他外观组成. (?<!^[A-Z0-9 ]*)是后面的另一个否定式,表达式为^[A-Z0-9 ]*.这意味着前面的内容不能全部为大写字母和数字.

The rest of the expression consists of other lookarounds. (?<!^[A-Z0-9 ]*) is another negative lookbehind, where the expression is ^[A-Z0-9 ]*. This means what precedes must not be all capital letters and numbers.

第二个环境是(?![A-Z0-9 ]$),这是一个负向的超前行为.这意味着后面的不是必须全部是大写字母和数字.

The second lookaround is (?![A-Z0-9 ]$), which is a negative lookahead. This means what follows must not be all capital letters and numbers.

因此,我们总共捕获了所有大写字母和数字的单词,并且从行首开始排除了一个字母的大写字符,而行中的所有内容均为大写.

So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.

这里至少存在一个弱点,因为第二个交替表达式中的环视独立地起作用,因此"A P1应该连接到J9"这样的句子将匹配J9,但不匹配P1,因为P1之前的所有内容均大写.

There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.

可以解决此问题,但它将使正则表达式的长度几乎增加三倍.很少有理由在单个正则表达式中尝试做很多事情.您最好将工作分解成多个正则表达式,或者用您选择的编程语言将正则表达式和标准字符串处理命令组合在一起.

It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.

这篇关于正则表达式仅匹配大写的“单词"除了一些例外的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆