正则表达式以匹配除AND,OR和NOT之外的所有单词 [英] regex to match all words but AND, OR and NOT

查看:28
本文介绍了正则表达式以匹配除AND,OR和NOT之外的所有单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的javascript应用中,我有以下随机字符串:

In my javascript app I have this random string:

büertAND NOT 3454jhadfüasdfsdf或技术AND(条形或浅色)

,除了单词 AND OR NOT ,我想匹配所有特殊字符和数字.

and i would like to match all words special chars and numbers besides the words AND, OR and NOT.

我尝试过的是这个

/(?! AND | OR | NOT)\ b [\ u00C0- \ u017F \ w \ d] +/gi
导致
[büert","3454jhadf","asdfsdf",技术","bar","bas"]

,但是由于 \ b 字词边界,该字词与ü或单词开头或结尾处的az字母之外的任何其他字母都不匹配

but this one does not match the ü or any other letter outside the a-z alphabet at the beginning or at the end of a word because of the \b word boundary.

删除 \ b 会奇怪地导致匹配部分或我想排除的单词:

removing the \b oddly ends up matching part or the words i would like to exclude:

/(?! AND | OR | NOT)[\ u00C0- \ u017F \ w \ d] +/gi
结果是
[büert","ND","OT","3454jhadf",üasdfsdf","R",技术","ND","bar","R","bas"]

除了我要排除的字符之外,不管它们包含哪种类型的字符,匹配所有单词的正确方法是什么?

what is the correct way to match all words no matter what type of characters they contain besides the ones i want exclude?

推荐答案

此问题的根源在于 \ b (和 \ w 速记类)在JavaScript中不支持Unicode.

The issue here has its roots in the fact that \b (and \w, and other shorthand classes) are not Unicode-aware in JavaScript.

现在,有两种方法可以实现您想要的.

Now, there are 2 ways to achieve what you want.

var re = /\s*\b(?:AND|OR|NOT)\b\s*|[()]/;
var s = "büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)";
var res = s.split(re).filter(Boolean);
document.body.innerHTML += JSON.stringify(res, 0, 4);
// = > [ "büert", "3454jhadf üasdfsdf", "technüology", "bar", "bas" ]

请注意使用非捕获组(?:...),以免将不需要的单词包括在结果数组中.另外,您需要将所有标点符号和其他不需要的字符添加到字符类中.

Note the use of a non-capturing group (?:...) so as not to include the unwanted words into the resulting array. Also, you need to add all punctuation and other unwanted characters to the character class.

您可以在正则表达式中将分组与锚点/反向否定字符类一起使用:

You can use groupings with anchors/reverse negated character class in a regex like this:

(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)

Capure组2将保存您需要的值.

The capure group 2 will hold the values you need.

请参见 regex演示

JS代码演示:

var re = /(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)/gi; 
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m;
var arr = []; 
while ((m = re.exec(str)) !== null) {
  arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);

或带有一个块以动态构建正则表达式:

or with a block to build the regex dynamically:

var bndry = "[^\\u00C0-\\u017F\\w]";
var re = RegExp("(^|" + bndry + ")" +                   // starting boundary
           "(?!(?:AND|OR|NOT)(?=" + bndry + "|$))" +    // restriction
           "([\\u00C0-\\u017F\\w]+)" +                  // match and capture our string
           "(?=" + bndry + "|$)"                        // set trailing boundary
           , "g"); 
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m, arr = []; 
while ((m = re.exec(str)) !== null) {
  arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);

说明:

  • (^ | [^ \ u00C0- \ u017F \ w])-我们的自定义边界(匹配以 ^ 开头的字符串或之外的任何字符[\ u00C0- \ u017F \ w] 范围)
  • (?!(?: AND | OR | NOT)(?= [^ \ u00C0- \ u017F \ w] | $))-匹配限制:匹配失败如果存在 AND OR NOT ,后跟字符串结尾或 \ u00C0- \ u017F中的字符之外的其他字符范围或非文字字符
  • ([[\ u00C0- \ u017F \ w] +)-匹配单词字符( [a-zA-Z0-9 _] )或中的字符> \ u00C0- \ u017F 范围
  • (?= [^ \ u00C0- \ u017F \ w] | $)-尾随边界,字符串结尾( $ )或除 \ u00C0- \ u017F 范围或非单词字符.
  • (^|[^\u00C0-\u017F\w]) - our custom boundary (match a string start with ^ or any character outside the [\u00C0-\u017F\w] range)
  • (?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$)) - a restriction on the match: the match is failed if there are AND or OR or NOT followed by string end or characters other than those in the \u00C0-\u017F range or non-word character
  • ([\u00C0-\u017F\w]+) - match word characters ([a-zA-Z0-9_]) or those from the \u00C0-\u017F range
  • (?=[^\u00C0-\u017F\w]|$) - the trailing boundary, either string end ($) or characters other than those in the \u00C0-\u017F range or non-word character.

这篇关于正则表达式以匹配除AND,OR和NOT之外的所有单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆