Javascript + Unicode正则表达式 [英] Javascript + Unicode regexes
问题描述
如何在JavaScript中使用支持Unicode的正则表达式?例如,应该有类似于\w的东西可以匹配字母或标记类别中的任何代码点(不仅仅是ASCII代码),并且希望像[[P *]]这样的过滤器用于标点符号等。
ES 6的情况
即将推出的ECMAScript语言规范,第6版,包括支持Unicode的正则表达式。必须使用正则表达式上的 u
修饰符启用支持。请参阅 ES6中支持Unicode的正则表达式 。
在ES 6完成并在浏览器供应商中广泛采用之前,你仍然可以自己动手。 更新:现在有一个名为 regexpu 的转录程序ES6 Unicode正则表达式转换为等效的ES5。它可以用作构建过程的一部分。 在线试用。
ES 5和在
尽管JavaScript在Unicode字符串上运行,但它不实现支持Unicode的字符类,并且没有POSIX字符类或Unicode块/子范围的概念。 / p>
-
在此处检查您的期望: Javascript RegExp Unicode字符类测试程序(编辑:原始页面已关闭,互联网档案馆仍有一份副本。)
-
Flagrant Badassery有一篇关于 JavaScript,正则表达式和Unicode ,揭示了这个问题。
-
此处还可以阅读 Regex和Unicode 。可能你必须建立自己的标点字符类。
-
查看正则表达式:匹配Unicode块范围构建器,它允许您构建一个JavaScript正则表达式,该表达式匹配属于任意数量的指定Unicode块的字符。
我刚刚为General Punctuation和Supplemental Punctuation子范围做了这个,结果就像我预期的一样简单直接:
[\ u2000-\ u206F \ u2E00-\\\ u2E7F]
-
还有 XRegExp ,通过提供具有扩展功能的替代正则表达式引擎,将 Unicode支持JavaScript 带入的项目。
-
当然,需要阅读: mathiasbynens。是 - JavaScript有一个Unicode问题:
How can I use Unicode-aware regular expressions in JavaScript? For example, there should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation etc.
Situation for ES 6
The upcoming ECMAScript language specification, edition 6, includes Unicode-aware regular expressions. Support must be enabled with the u
modifier on the regex. See Unicode-aware regular expressions in ES6.
Until ES 6 is finished and widely adopted among browser vendors you're still on your own, though. Update: There is now a transpiler named regexpu that translates ES6 Unicode regular expressions into equivalent ES5. It can be used as part of your build process. Try it out online.
Situation for ES 5 and below
Even though JavaScript operates on Unicode strings, it does not implement Unicode-aware character classes and has no concept of POSIX character classes or Unicode blocks/sub-ranges.
Check your expectations here: Javascript RegExp Unicode Character Class tester (Edit: the original page is down, the Internet Archive still has a copy.)
Flagrant Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the matter.
Also read Regex and Unicode here on SO. Probably you have to build your own "punctuation character class".
Check out the Regular Expression: Match Unicode Block Range builder, which lets you build a JavaScript regular expression that matches characters that fall in any number of specified Unicode blocks.
I just did it for the "General Punctuation" and "Supplemental Punctuation" sub-ranges, and the result is as simple and straight-forward as I would have expected it:
[\u2000-\u206F\u2E00-\u2E7F]
There also is XRegExp, a project that brings Unicode support to JavaScript by offering an alternative regex engine with extended capabilities.
And of course, required reading: mathiasbynens.be - JavaScript has a Unicode problem:
这篇关于Javascript + Unicode正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!