用正则表达式拆分和替换javascript中的unicode单词 [英] split and replace unicode words in javascript with regex

查看:72
本文介绍了用正则表达式拆分和替换javascript中的unicode单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要在{}中将unicode字列表放在unicode字符串中。
有我的代码:

  var txt =¿一;一é二二二二二twöttwöu三;; TWA;四?; 
var re = new RegExp((^ | \\ W)(one | tw | two two | two |twöu| three |föur)(?= \\| | $), GI);
alert(txt.replace(re,'$ 1 {$ 2}'));

它返回:



¿{One }; {one} {one}é{two two} {two two} {two} {tw}ö{tw}öu{three} ;; {tw}ä; {föur}?



但应该是:



¿{One}; {one}oneé{two two} {two two} {two}twö{twöu} {3} ;;twä; {föur}?



我做错了什么?

解决方案

问题




我做错了什么?


不幸的是,答案是你正在做 nothing 错误。 Javascript是。



问题是Javascript不支持 Unicode这样的正则表达式 Unicode标准中详细说明。



然而,有一个相当不错的库叫做 XRegExp ,其中包含 JavaScript插件有很大帮助。我推荐它,尽管有一些值得注意的警告。你需要知道做什么,不能






它做什么




  • 纠正Javascript实现中不一致的各种错误,包括拆分功能

  • 支持所涵盖的BMP代码点从2012年1月开始的Unicode字符数据库的6.1版本。

  • 根据标准正确忽略Unicode属性名称中的大小写,空格,连字符和下划线> - 甚至Java出错的东西。

  • 支持字母和<$的Unicode常规类别,如 \p {L} c $ c> \p {Sc} 代表货币符号。

  • 支持标准的完整属性名称,例如 \p {Letter } for \p {L} \p {Currency_Symbol} for \p {Sc}

  • 支持Unicode Scri pt属性,例如 \p {Latin} \p {Greek} \p {Common}

  • 支持Unicode Block属性,例如 \p {InBasic_Latin} \p {InMathematical_Alphanumeric_Symbols}

  • 支持一级合规性所需的其他9个Unicode属性: \p {Alphabetic} \p {大写} \p {小写} \p {White_Space} \p {Noncharacter_Code_Point} \p {Default_Ignorable_Code_Point} \p {Any} \p {ASCII} \p {已分配}

  • 使用标准支持命名的捕获而不是编号的捕获表示法:(?< NAME>⋯)声明一个命名组, \k< NAME> 按名称对其进行反向,并在替换模式中使用 $ {NAME} (通常使用 result.NA访问它你的代码中的ME 。这与Perl 5.10,Java 7,.ɴᴇᴛ和其他几种语言使用的语法相同。通过让你为部件命名而不仅仅是编号来编写复杂的正则表达式会更容易,这样当你移动东西时就不必重新计算编号的变量了。

  • 支持 / s ᴀᴋᴀ(?s)模式以便点匹配任何单个代码点,而不是除了换行序列之外的任何内容。大多数其他正则表达式引擎都支持此模式。

  • 支持 / x ᴀᴋᴀ(?x) mode,以便忽略空格和注释(如果未转义)。大多数正则表达式引擎都支持此模式。这对于创建易读 - 因此可维护 - 模式绝对是必不可少的。

  • 即使不在 / x 模式下,也支持嵌入式注释标准的(?#⋯)表示法(如Perl中所见)。这使得您可以将注释放在单独的正则表达式中,而无需一直到 / x 模式,这通常很重要,可以开发更复杂的模式,允许您构建它们-wise。

  • 支持可扩展性,以便您可以根据需要添加新的令牌类型,例如 \ a 表示ALERT字符或POSIXish字符类。



它不是什么



<但是,你应该小心它做做的事情:




  • 不支持完整的Unicode,但只有来自平面0的代码点。这是一个禁止的限制,因为 Unicode标准要求正则表达式中的星体和非星体代码点之间没有区别。即使Java在JDK7之前也没有这样做。 (但是,v2.1.0开发版本确实支持完整的Unicode。)

  • 对于字形集群不支持 \ X ,或者 \ R 用于换行序列。

  • 不支持两部分属性,例如 \p { GC = Letter} \p {Block = Phonetic_Extensions} \p {Script = Greek} \p {Bidi_Class = Right_to_Left} \p {Word_Break = A_Letter} ,和 \p {Numeric_Value = 10}

  • 它不会根据 UTS#18 。标准JavaScript只允许 \s 匹配Unicode \p {White_Space} 属性;它不允许 \d 匹配 \p {Nd} (尽管一些旧浏览器会这样做!)或 \w 匹配 [\p {Alphabetic} \pM\p {Nd} \p {Pc}] ,更不用说提供<* c $ c> \b 和 \B 的支持Unicode的版本,所有这些都是支持Unicode正则表达式的要求的一部分。

  • 它不支持一些常用的属性。在实践中,缺少的是 \p {digit} ,也许还有相当有用的 \p {Dash} \p {Math} \p {Diacritic} \p {Quotation_Mark} 属性。

  • 不支持字形集群,例如使用 \ X ,甚至不能通过(?:\ p {Grapheme_Base} \p {Grapheme_Extend} *)这是一个非常重要的事情。



解决方法



以下是处理库不遵循的几个地方的一些解决方法 Unicode标准:




  • 对于缺少的 \w ,您可以使用 [\p {L} \p {Nl} \p {钕} \p {M} \p {InEnclosedAlphanumerics}] 。它仅在附带的数字中夸大了问题,因为它们不是 \p {Nd} - 类型数字,这是唯一计为字母数字的数字。

  • 对于缺少的 \W ,您可以使用前一个的set-complement,所以 [^ \\ \\p {L} \p {NL} \p {钕} \p {M} \p {InEnclosedAlphanumerics}] 。它仅在随附的数字中夸大了问题。

  • 因为 \ b 真的相同(? :(?< = \w)(?!\ w)|(?<!\w)(?= \ w)),你可以 \w 定义插入该序列以创建支持Unicode的 \b 版本 - 提供JavaScript支持所有四个方向的外观,当我最后检查时,它没有。为了正确地做这件事,你同时具有正面和负面的外观,而不仅仅是前瞻。 Javascript忽略支持这些,至少就我所见。

  • 由于 \B (?:(?< = \w)(?= \ w)|(?<!\w)(?!\ w)),您也可以这样做,但条件相同。

  • 对于缺少的 \ X ,你可以使用 \P {M} \p {M} * 来获得sorta,但这会错误地拆分CRLF结构并允许标记相同,所有这是非常错误的。

  • 对于缺少的 \ R ,您可以使用(?:\\\\ n | [\ n-\\\\\\\\\\\\\\\\\\\\\\\\\\\ 。






摘要



结论是JavaScript的正则表达式完全不适合Unicode工作。但是, XRegExp插件更接近于实现可行性。如果你可以忍受它的限制,这可能比切换到另一种但支持Unicode的编程语言更容易。它肯定比无法使用Unicode正则表达式更好。



然而,它仍然是满足最基本要求(1级支持)的很长的方法对于标准中详细说明的Unicode正则表达式。有一天你会想要能够匹配字符,无论它们是否有重音符号,或者在数学字母数字符号块中设置,或者使用Unicode案例映射和大小写折叠定义,或者跟随 Unicode标准用于字母数字排序或用于破解行和破解,即使使用插件,您也无法 -在。



因此,您可能希望考虑使用符合 Unicode标准的语言,如果您确实需要处理Unicode。 Javascript只是无法管理。


Need to put list of unicode words in unicode string in {}. There is my code:

var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));

It returns:

¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?

but should be:

¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?

What I'm doing wrong?

解决方案

The Problem

What am I doing wrong?

Unfortunately, the answer is that you are doing nothing wrong. Javascript is.

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

There is, however, a rather nice library called XRegExp which has a JavaScript plugin that helps a great deal. I recommend it, albeit with several notable caveats. You need to know what it can do, and what it cannot.


What It Does

  • Corrects various bugs in inconsistencies in Javascript implementations, including its split function.
  • Supports the BMP code points covered by the 6.1 release of the Unicode Character Database, from January 2012.
  • Correctly ignores case, space, hyphen-minuses, and underscores in Unicode property names, per The Standard — something which even Java gets wrong.
  • Supports the Unicode General Categories like \p{L} for letters and \p{Sc} for currency symbols.
  • Support the standard full property names like \p{Letter} for \p{L} and \p{Currency_Symbol} for \p{Sc}.
  • Supports the Unicode Script properties, like \p{Latin}, \p{Greek}, and \p{Common}.
  • Supports the Unicode Block properties, like \p{InBasic_Latin} and \p{InMathematical_Alphanumeric_Symbols}.
  • Supports the other 9 Unicode properties needed for level-1 compliance: \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{Any}, \p{ASCII}, and \p{Assigned}.
  • Supports named captures instead of just numbered ones, using standard notation to do so: (?<NAME>⋯) to declare a named group, \k<NAME> to backref it by name, and use ${NAME} in the replacement pattern (and in general access it using result.NAME in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing complex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables.
  • Supports /s ᴀᴋᴀ (?s) mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode.
  • Supports /x ᴀᴋᴀ (?x) mode so that whitespace and comments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns.
  • Supports embedded comments even when not in /x mode using the standard (?#⋯) notation to do so (such as seen in Perl). This lets you put comments in individual regex pieces without going all the way to /x mode, which is often important in developing more complex patterns, by allowing you to build them up piece-wise.
  • Supports extensibility, so that you can add new token types if you want, such as \a to mean the ALERT character, or the POSIXish character classes.

What It Doesn’t

You should be careful, however, for the things that it does not do:

  • Does not support full Unicode, but only code points from Plane 0. This is a forbidden restriction, as The Unicode Standard requires that there be no difference between astral and non-astral code points in a regular expression. Even Java doesn’t get this right until JDK7. (However, the v2.1.0 development version does support full Unicode.)
  • Does not support \X for grapheme clusters, or \R for linebreak sequences.
  • Does not support two-part properties, like \p{GC=Letter}, \p{Block=Phonetic_Extensions}, \p{Script=Greek}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}.
  • It does not update the character class shortcuts to operate per the requirements of UTS#18. Standard JavaScript only allows \s to match the Unicode \p{White_Space} property; it does not allow \d to match \p{Nd} (although some old browsers will do that anyway!) nor \w to match [\p{Alphabetic}\pM\p{Nd}\p{Pc}], let alone providing Unicode-aware versions of \b and \B, all of which are part of the requirements for supporting Unicode Regular Expressions.
  • It does not support some commonly used properties. In practice, the one that is missing is \p{digit}, and perhaps also the rather useful \p{Dash}, \p{Math}, \p{Diacritic}, and \p{Quotation_Mark} properties.
  • Has no support for grapheme clusters such as using \X or even via (?:\p{Grapheme_Base}\p{Grapheme_Extend}*). This is a really big deal.

Workarounds

Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:

  • For the missing \w, you can use [\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers, as they’re not \p{Nd}-type numbers which are the only ones that count as alphanumeric.
  • For the missing \W, you can therefore use the set-complement of the previous one, so [^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers.
  • Since \b is really the same as (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)), you could plug that \w definition into that sequence to create a Unicode-aware version of \b — provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see.
  • Since \B is really the same as (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)), you could do the same, but subject to the same conditions.
  • For the missing \X, you can get sorta close by using \P{M}\p{M}*, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong.
  • For the missing \R, you can construct a work-around using (?:\r\n|[\n-\r\u0085\u2028\u2029]).

Summary

The conclusion is that JavaScript’s regexes are completely unsuited for Unicode work. However, the XRegExp plugin moves closer to making that feasible. If you can live with its restrictions, this is probably easier than switching to a different but Unicode-aware programming language. It’s certainly better than being unable to use Unicode regexes even at all.

However, it is still a rather long ways from meeting the very most basic requirements (Level 1 support) for Unicode regexes as spelled out in the standard. Someday you are going to want to be able to match characters whether they have accent marks on them or not, or which set up in the Mathematical Alphanumeric Symbols block, or which use the Unicode case-mapping and case-folding definitions, or which follow The Unicode Standard for alphanumeric sorts or for line- and word-breaking, and you cannot do any of those things in Javascript even with the plug-in.

So you might wish to consider using a language that is compliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.

这篇关于用正则表达式拆分和替换javascript中的unicode单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆