javascript字符串的javascript正则表达式 [英] javascript regex of a javascript string

查看:121
本文介绍了javascript字符串的javascript正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要匹配一个带有正则表达式的javascript字符串,这是一个由单引号括起来的字符串,只能包含一个反斜杠的单引号。



示例我匹配的字符串如下所示:

 'abcdefg'
'abc\'defg'
'abc\'de\'fg'


解决方案

这是匹配所有有效JavaScript文字字符串的正则表达式(由单引号'包围)并拒绝所有无效的字符串。请注意,假设严格模式。

  /'(?:[^'\\ \\\
\r\\\
\\\
] | \\(?:[' \\bfnrtv] | [^ \\\
\r\\\
\\\
' \\\ \\bfnrtvxu0-9] | 0([0-9]?!)| X [0-9A-FA-F] {2} | U [0-9A-FA-F] {4})| \\ (?:\ n | \r\ n | \ r(?!\ n)| [\\\\\\\\\\\\\\\\\\\\\\\\\\\ / pre>

或更短的版本:

  /'(? :[^'\\\\\
\r\\\
\\\
] | \\(?:?![^ \\\
\rxu0-9] | 0([0-9] )| x [0-9a-fA-F] {2} | u [0-9a-fA-F] {4} | \ n | \\\\ n?))*'/

上面的正则表达式基于 StringLiteral 的定义(忽略2011年6月发布的ECMAScript语言规范5.1版中指定的双引号版本。



用双引号包围的JavaScript文字字符串的正则表达式几乎相同:

  /(?:[^\\\\\
\r\ u2028 \ u2029] | \\(?:[ ^ \\\
\rxu0-9] | 0([0-9]?!)| X [0-9A-FA-F] {2} | U [0-9A-FA-F] {4} | \ n | \\ n?))*/






让我们剖析怪物(较长的版本,因为它是从语法直接翻译):




  • StringLiteral (忽略双引号版本)以'开头和结尾,因为它可以在正则表达式中看到。在引号之间是 SingleStringCharacter 的可选序列。这解释了 * - 0个或更多字符。


  • SingleStringCharacter 定义为:

     
    SingleStringCharacter ::
    SourceCharacter但不是'或\或LineTerminator $之一b $ b \ EscapeSequence
    LineContinuation

    [^'\\\ n \\ r \\\ u2028 \ u202029] 对应第一条规则



    \\(?:[ \\bfnrtv] | [^ \\\
    \r\\\
\\\
' \\bfnrtvxu0-9] | 0([0-9]?!)| X [0-9A -fA-F] {2} | u [0-9a-fA-F] {4})
    对应第二条规则



    \\(?:?!\\\
    | \r\\\
    | \r(\\\
    )| [\\\
\\\
])
    对应第三条规则


  • 让我们看看第一条规则: SourceCharacter但不是'或\或LineTerminator 。第一条规则处理普通字符。



    SourceCharacter 是任何Unicode单位。



    LineTerminator 换行< LF> \ u000A \ n ),回车< CR> \ u000D \ r ),行分隔符< LS> \ u2028 )或段落分隔符< PS> \ u2029 )。



    因此我们只使用负字符类来表示此规则: [^'\ \\ n \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\序列,您可以在 EscapeSequence 之前看到 \ ,因为它出现在正则表达式中。至于 EscapeSequence ,这是它的语法(严格模式):

     
    EscapeSequence: :
    CharacterEscapeSequence
    0 [lookahead∉DecimalDigit]
    HexEscapeSequence
    UnicodeEscapeSequence

    ['\\bfnrtv] | [^ \ n \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ code> CharacterEscapeSequence 。它实际上可以简化为 [^ \\\
    \r\\\
 \ u2029xu0-9]



    第一部分是 SingleEscapeCharacter ,其中包括' \ ,对于控制字符 b f n r t v



    第二部分是 NonEscapeCharacter ,这是 SourceCharacter但不是EscapeCharacter或LineTerminator EscapeCharacter 定义为 SingleEscapeCharacter DecimalDigit x (对于十六进制转义序列)或 u (对于unicode转义序列)。



    0(?![0-9])是正则表达式对于 EscapeSequence 的第二条规则。这用于指定空字符 \0



    x [0-9a-fA-F] {2} 的正则表达式HexEscapeSequence



    u [0-9a-fA-F] {4} UnicodeEscapeSequence的正则表达式


  • 第三条规则处理跨越多行的字符串。让我们看看 LineContinuation 的语法和其他相关的:

      LineContinuation :: 
    \ LineTerminatorSequence

    LineTerminatorSequence ::
    < LF>
    < CR> [lookahead∉< LF> ]
    < LS>
    < PS>
    < CR> < LF>

    \\(?:\ n | \\\\ \\ n | \ r(?!\ n)| [\ u2028 \ u2020])对应上述语法。



I need to match a javascript string, with a regular expression, that is a string enclosed by single quote and can only contain a backslashed single quote.

The examples string that i would match are like the following:

'abcdefg'
'abc\'defg'
'abc\'de\'fg'

解决方案

This is the regex that matches all valid JavaScript literal string (that is surrounded by single quote ') and reject all invalid ones. Note that strict mode is assumed.

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})|\\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]))*'/

Or a shorter version:

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*'/

The regex above is based on the definition of StringLiteral (ignoring the double quoted version) specified in ECMAScript Language Specification, 5.1 Edition published in June 2011.

The regex for the JavaScript literal string surrounded with double quote " is almost the same:

/"(?:[^"\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*"/


Let's dissect the monster (the longer version, since it is direct translation from the grammar):

  • A StringLiteral (ignoring the double quote version) starts and ends with ', as it can be seen in the regex. In between the quotes is an optional sequence of SingleStringCharacter. This explains the * - 0 or more characters.

  • SingleStringCharacter is defined as:

    SingleStringCharacter ::
           SourceCharacter but not one of ' or \ or LineTerminator
           \ EscapeSequence
           LineContinuation
    

    [^'\\\n\r\u2028\u2029] corresponds to the first rule

    \\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}) corresponds to the second rule

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the third rule

  • Let's look at the first rule: SourceCharacter but not one of ' or \ or LineTerminator. This first rule deals with "normal" characters.

    SourceCharacter is any Unicode unit.

    LineTerminator is Line Feed <LF> (\u000A or \n), Carriage Return <CR> (\u000D or \r), Line Separator <LS> (\u2028) or Paragraph Separator <PS> (\u2029).

    So we will just use a negative character class to represent this rule: [^'\\\n\r\u2028\u2029].

  • For the second rule, which deals with escape sequences, you can see \ before EscapeSequence, as it appears in the regex. As for EscapeSequence, this is its grammar (strict mode):

    EscapeSequence ::
            CharacterEscapeSequence
            0 [lookahead ∉ DecimalDigit]
            HexEscapeSequence
            UnicodeEscapeSequence
    

    ['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9] is the regex for CharacterEscapeSequence. It can actually be simplified to [^\n\r\u2028\u2029xu0-9]

    The first part is SingleEscapeCharacter, which includes ', ", \, and for control characters b, f, n, r, t, v.

    The second part is NonEscapeCharacter, which is SourceCharacter but not one of EscapeCharacter or LineTerminator. EscapeCharacter is defined as SingleEscapeCharacter, DecimalDigit or x (for hex escape sequence) or u (for unicode escape sequence).

    0(?![0-9]) is the regex for the second rule of EscapeSequence. This is for specifying null character \0.

    x[0-9a-fA-F]{2} is the regex for HexEscapeSequence

    u[0-9a-fA-F]{4} is the regex for UnicodeEscapeSequence

  • The third rule deals with string that spans multiple lines. Let's look at the grammar of LineContinuation and other related:

    LineContinuation ::
            \ LineTerminatorSequence
    
    LineTerminatorSequence :: 
            <LF> 
            <CR> [lookahead ∉ <LF> ]
            <LS>
            <PS>
            <CR> <LF>
    

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the above grammar.

这篇关于javascript字符串的javascript正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆