学习正则表达式 [英] Learning Regular Expressions

查看:38
本文介绍了学习正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太了解正则表达式.你能用一种易于理解的方式向我解释它们吗?如果有任何在线工具或书籍,您是否也可以链接到它们?

解决方案

最重要的部分是概念.一旦您了解了构建块的工作原理,语法上的差异就只不过是温和的方言而已.正则表达式引擎语法之上的一层是您正在使用的编程语言的语法.Perl 等语言消除了大部分这种复杂性,但如果您在 C 程序中使用正则表达式,则必须牢记其他注意事项.

如果您将正则表达式视为可以随意混合和匹配的构建块,它可以帮助您学习如何编写和调试自己的模式,以及如何理解他人编写的模式.

从简单开始

从概念上讲,最简单的正则表达式是文字字符.模式 N 匹配字符 'N'.

彼此相邻的正则表达式匹配序列.例如,模式 Nick 匹配序列 'N' 后跟 'i' 后跟 'c' 后跟 'k'.

如果你曾经在 Unix 上使用过 grep——即使只是为了搜索普通的字符串——你已经在使用正则表达式了!(grep 中的 re 指的是正则表达式.)

从菜单订购

稍微增加一点复杂性,您可以将 'Nick' 或 'nick' 与模式 [Nn]ick 匹配.方括号中的部分是一个字符类,这意味着它与封闭字符中的一个完全匹配.您还可以在字符类中使用范围,因此 [a-c] 匹配 'a' 或 'b' 或 'c'.

模式 . 是特殊的:它不只匹配文字点,而是匹配任何字符.它在概念上与真正的大字符类 [-.?+%$A-Za-z0-9...] 相同.

将字符类视为菜单:仅选择一个.

有用的快捷键

使用 . 可以为您节省大量输入,并且还有其他常见模式的快捷方式.假设你想匹配一个数字:一种写法是 [0-9].数字是一个频繁匹配的目标,因此您可以使用快捷方式 \d.其他是 \s(空格)和 \w(单词字符:字母数字或下划线).

大写变体是它们的补码,因此 \S 匹配任何 -空白字符,例如.

一次还不够

从那里,您可以使用量词重复部分模式.例如,模式 ab?c 匹配 'abc' 或 'ac' 因为 ? 量词使得它修改的子模式成为可选的.其他量词是

  • *(零次或多次)
  • +(一次或多次)
  • {n}(正好 n 次)
  • {n,}(至少 n 次)
  • {n,m}(至少 n 次但不超过 m 次)

将这些块中的一些放在一起,模式 [Nn]*ick 匹配所有

  • ick
  • 尼克
  • 尼克
  • 妮妮
  • 尼克
  • 妮妮
  • (等等)

第一次匹配展示了一个重要的教训:*总是成功!任何模式都可以匹配零次.

其他一些有用的例子:

  • [0-9]+(及其等效的\d+)匹配任何非负整数
  • \d{4}-\d{2}-\d{2} 匹配格式为 2019-01-01 的日期

分组

量词修改其左边的模式.您可能希望 0abc+0 匹配 '0abc0'、'0abcabc0' 等,但是加量词左边的模式 立即c.这意味着 0abc+0 匹配0abc0"、0abcc0"、0abccc0"等.

要匹配一个或多个末端为 0 的 'abc' 序列,请使用 0(abc)+0.括号表示可以量化为一个单位的子模式.正则表达式引擎保存或捕获"与括号组匹配的输入文本部分也很常见.与计数索引和 substr 相比,以这种方式提取位更加灵活且不易出错.

交替

之前,我们看到了一种匹配尼克"或尼克"的方法.另一个是与Nick|nick 中的交替.请记住,交替包括其左侧的所有内容和其右侧的所有内容.使用分组括号来限制|的范围,例如(Nick|nick).

再举一个例子,你可以将 [ac] 等价地写成 a|b|c,但这可能不是最理想的,因为许多实现假设替代品有长度大于 1.

逃脱

虽然有些字符与自己匹配,但其他字符具有特殊含义.模式 \d+ 不匹配反斜杠后跟小写 D 后跟加号:要得到它,我们将使用 \\d\+.反斜杠删除了以下字符的特殊含义.

贪婪

正则表达式量词是贪婪的.这意味着它们匹配尽可能多的文本,同时允许整个模式成功匹配.

例如,假设输入是

<块引用>

你好,"她说,你好吗?"

您可能希望 ".+" 只匹配 'Hello',然后当您看到它从 'Hello' 一直匹配到 'you?' 时会感到惊讶.>

要从贪婪切换到您可能认为的谨慎,请在量词中添加一个额外的 ?.现在您了解了 \((.+?)\),您问题中的示例是如何工作的.它匹配文字左括号的序列,后跟一个或多个字符,并以右括号结尾.

如果您的输入是(123) (456)",那么第一次捕获将是123".非贪婪量词希望让模式的其余部分尽快开始匹配.

(至于你的困惑,我不知道任何正则表达式方言 ((.+?)) 会做同样的事情.我怀疑某些东西在传输过程中丢失了方式.)

锚点

使用特殊模式 ^ 仅匹配输入的开头,使用 $ 仅匹配结尾.用你的图案制作书挡",你说我知道正面和背面是什么,但给我之间的一切"是一种有用的技巧.

说你想匹配表单的评论

<块引用>

-- 这是一条评论 --

你会写^--\s+(.+)\s+--$.

建立自己的

正则表达式是递归的,所以现在您了解了这些基本规则,您可以随意组合它们.

用于编写和调试正则表达式的工具:

书籍

免费资源

脚注

†: 上面关于 . 匹配任何字符的陈述是出于教学目的的简化,并非严格正确.点匹配除换行符 "\n" 之外的任何字符,但在实践中,您很少期望像 .+ 这样的模式跨越换行符边界.Perl 正则表达式有一个 /s 开关 和 Java Pattern.DOTALL,例如,使 . 完全匹配任何字符.对于没有这种功能的语言,您可以使用类似 [\s\S] 之类的东西来匹配任何空格或任何非空格",即任何内容.

I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?

解决方案

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.

If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.

Start simple

Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.

Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.

If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)

Order from the menu

Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.

The pattern . is special: rather than matching a literal dot only, it matches any character. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].

Think of character classes as menus: pick just one.

Helpful shortcuts

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

  • * (zero or more times)
  • + (one or more times)
  • {n} (exactly n times)
  • {n,} (at least n times)
  • {n,m} (at least n times but no more than m times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

  • ick
  • Nick
  • nick
  • Nnick
  • nNick
  • nnick
  • (and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

A few other useful examples:

  • [0-9]+ (and its equivalent \d+) matches any non-negative integer
  • \d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

For example, say the input is

"Hello," she said, "How are you?"

You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.

To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.

If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.

(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)

Anchors

Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.

Say you want to match comments of the form

-- This is a comment --

you'd write ^--\s+(.+)\s+--$.

Build your own

Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.

Tools for writing and debugging regexes:

Books

Free resources

Footnote

†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

这篇关于学习正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆