javascript的正则表达式匹配任何脚本中所有非单词字符的正确正则表达式范围是多少? [英] What's the correct regex range for javascript's regexes to match all the non word characters in any script?

查看:59
本文介绍了javascript的正则表达式匹配任何脚本中所有非单词字符的正确正则表达式范围是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python或PHP中,简单的正则表达式(例如/\W/gu)会匹配任何脚本中的任何非单词字符,而在javascript中却会匹配[^A-Za-z0-9_],与python和PHP匹配相同字符的正确范围是什么? /p>

https://regex101.com/r/yhNF8U/1/

解决方案

通用解决方案

Mathias Bynens建议遵循 UTS18 建议,因此应遵循Unicode意识如下所示:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

请注意有关建议的Unicode属性类组合的注释:

这仅是单词边界的近似值(请参见下面的 b ) .这 添加了连接器标点符号以用于编程语言 标识符,因此将"_"添加到和类似的字符.

更多注意事项

\w构造(以及与之对应的\W)在支持Unicode的上下文中进行匹配时,会在正则表达式引擎中匹配相似但略有不同的字符集.

例如,这里是文档), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}],其中\p{gc=Mn}\p{gc=Me}\p{gc=Mc}可以写为\p{M}.

在PHP PCRE中,\W[^\p{L}\p{N}_]匹配.

Rexegg速查表将Python 3 \w定义为" Unicode字母,表意文字,数字或下划线",即[\p{L}\p{Mn}\p{Nd}_].

您可以将\W大致分解为[^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

其中

  • [^-是否定字符类的开头,该字符类与除以下以外的单个字符相匹配:
    • \p{L}-任何Unicode字母
    • \p{N}-任意Unicode数字
    • \p{M}-变音符号
    • \p{Pc}-连接器标点符号
  • ]-字符类的结尾.

请注意,这是与下划线匹配的\p{Pc}类.

注意\p{Alphabetic}(\p{Alpha})包括所有与\p{L}匹配的字母,以及由>匹配的字母数字(例如, –罗马数字12的字符,以及与\p{Other_Alphabetic}(\p{OAlpha})匹配的其他一些符号.

其他版本:

  • /[^\p{L}0-9_]/gu-仅使用仅识别Unicode字母的\W
  • /[^\p{L}\p{N}_]/gu-(PCRE \W样式)仅使用仅识别Unicode字母和数字的\W.

请注意,Java的(?U)\W将匹配PCRE,Python和.NET中的\W匹配项.

In python or PHP a simple regex such as /\W/gu matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_], what are the correct ranges to match the same characters as python and PHP?

https://regex101.com/r/yhNF8U/1/

解决方案

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W will look like:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.

More considerations

The \w construct (and thus its \W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: \W .NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}], where \p{Ll}\p{Lu}\p{Lt}\p{Lo} can be contracted to a sheer \p{L} and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}].

In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}], where \p{gc=Mn}\p{gc=Me}\p{gc=Mc} can be just written as \p{M}.

In PHP PCRE, \W matches [^\p{L}\p{N}_].

Rexegg cheat sheet defines Python 3 \w as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_].

You may roughly decompose \W as [^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

where

  • [^ - is the start of the negated character class that matches a single char other than:
    • \p{L} - any Unicode letter
    • \p{N} - any Unicode digit
    • \p{M} - a diacritic mark
    • \p{Pc} - a connector punctuation symbol
  • ] - end of the character class.

Note it is \p{Pc} class that matches an underscore.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

Other variations:

  • /[^\p{L}0-9_]/gu - to just use \W that is aware of Unicode letters only
  • /[^\p{L}\p{N}_]/gu - (PCRE \W style) to just use \W that is aware of Unicode letters and digits only.

Note that Java's (?U)\W will match a mix of what \W matches in PCRE, Python and .NET.

这篇关于javascript的正则表达式匹配任何脚本中所有非单词字符的正确正则表达式范围是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆