在正则表达式中应该避免什么文字? [英] What literal characters should be escaped in a regex?

查看:134
本文介绍了在正则表达式中应该避免什么文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚写了一个正则表达式用于包含以下部分的php函数 preg_match

  [\w-。] 

要匹配任何字符,以及减号和点。虽然它似乎在preg_match中工作,但我试图将其放入一个名为 Reggy 的实用程序中,并且它关于的投诉 char类中的空范围。试用和错误告诉我,这个问题是通过转义减号来解决的,将正则表达式转换成

  [\w\ - 。] 

由于原来似乎在PHP中工作,我想知道为什么我应该或不应该逃避减号,并且 - 由于点也是PHP中含义的字符 - 为什么我不需要转义点。我使用的实用程序只是愚蠢的,它是使用另一个正则表达式方言,还是我的正则表达式确实不正确,我只是幸运的preg_match让我摆脱它?



字符类中的元字符是:




  • ^ (否定)

  • - (范围)

  • ] (课程结束)

  • \ (escape char)



所以这些都应该逃脱有一些角落的情况:




  • - 如果放置在非常开始或结束( [abc - ] [ - abc] )。在不同的正则表达式实现中,当直接放在范围( [ac-abc] )或短手字符类( [\w-ABC] )。这是你观察到的

  • ^ 在课程开始时不需要转义 [^ a] 表示除 a 之外的任何字符,而 [a ^] 匹配 a ^ ,等于: [\ ^ a ]

  • ] 如果它是类中唯一的字符,则不需要转义: []] 匹配char ]


I just wrote a regex for use with the php function preg_match that contains the following part:

[\w-.]

To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

[\w\-.]

Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?

解决方案

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • \ (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([\w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [\^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]

这篇关于在正则表达式中应该避免什么文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆