正则表达式以匹配除AND,OR和NOT之外的所有单词 [英] regex to match all words but AND, OR and NOT
问题描述
在我的javascript应用中,我有以下随机字符串:
In my javascript app I have this random string:
büertAND NOT 3454jhadfüasdfsdf或技术AND(条形或浅色)
,除了单词 AND
, OR
和 NOT
,我想匹配所有特殊字符和数字.
and i would like to match all words special chars and numbers besides the words AND
, OR
and NOT
.
我尝试过的是这个
/(?! AND | OR | NOT)\ b [\ u00C0- \ u017F \ w \ d] +/gi
导致 [büert","3454jhadf","asdfsdf",技术","bar","bas"]
,但是由于 \ b
字词边界,该字词与ü
或单词开头或结尾处的az字母之外的任何其他字母都不匹配
but this one does not match the ü
or any other letter outside the a-z alphabet at the beginning or at the end of a word because of the \b
word boundary.
删除 \ b
会奇怪地导致匹配部分或我想排除的单词:
removing the \b
oddly ends up matching part or the words i would like to exclude:
/(?! AND | OR | NOT)[\ u00C0- \ u017F \ w \ d] +/gi
结果是 [büert","ND","OT","3454jhadf",üasdfsdf","R",技术","ND","bar","R","bas"] 代码>
除了我要排除的字符之外,不管它们包含哪种类型的字符,匹配所有单词的正确方法是什么?
what is the correct way to match all words no matter what type of characters they contain besides the ones i want exclude?
推荐答案
此问题的根源在于 \ b
(和 \ w
速记类)在JavaScript中不支持Unicode.
The issue here has its roots in the fact that \b
(and \w
, and other shorthand classes) are not Unicode-aware in JavaScript.
现在,有两种方法可以实现您想要的.
Now, there are 2 ways to achieve what you want.
var re = /\s*\b(?:AND|OR|NOT)\b\s*|[()]/;
var s = "büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)";
var res = s.split(re).filter(Boolean);
document.body.innerHTML += JSON.stringify(res, 0, 4);
// = > [ "büert", "3454jhadf üasdfsdf", "technüology", "bar", "bas" ]
请注意使用非捕获组(?:...)
,以免将不需要的单词包括在结果数组中.另外,您需要将所有标点符号和其他不需要的字符添加到字符类中.
Note the use of a non-capturing group (?:...)
so as not to include the unwanted words into the resulting array. Also, you need to add all punctuation and other unwanted characters to the character class.
您可以在正则表达式中将分组与锚点/反向否定字符类一起使用:
You can use groupings with anchors/reverse negated character class in a regex like this:
(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)
Capure组2将保存您需要的值.
The capure group 2 will hold the values you need.
请参见 regex演示
JS代码演示:
var re = /(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)/gi;
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m;
var arr = [];
while ((m = re.exec(str)) !== null) {
arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);
或带有一个块以动态构建正则表达式:
or with a block to build the regex dynamically:
var bndry = "[^\\u00C0-\\u017F\\w]";
var re = RegExp("(^|" + bndry + ")" + // starting boundary
"(?!(?:AND|OR|NOT)(?=" + bndry + "|$))" + // restriction
"([\\u00C0-\\u017F\\w]+)" + // match and capture our string
"(?=" + bndry + "|$)" // set trailing boundary
, "g");
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m, arr = [];
while ((m = re.exec(str)) !== null) {
arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);
说明:
-
(^ | [^ \ u00C0- \ u017F \ w])
-我们的自定义边界(匹配以^
开头的字符串或之外的任何字符[\ u00C0- \ u017F \ w]
范围) -
(?!(?: AND | OR | NOT)(?= [^ \ u00C0- \ u017F \ w] | $))
-匹配限制:匹配失败如果存在AND
或OR
或NOT
,后跟字符串结尾或\ u00C0- \ u017F中的字符之外的其他字符
范围或非文字字符 -
([[\ u00C0- \ u017F \ w] +)
-匹配单词字符([a-zA-Z0-9 _]
)或中的字符> \ u00C0- \ u017F
范围 -
(?= [^ \ u00C0- \ u017F \ w] | $)
-尾随边界,字符串结尾($
)或除\ u00C0- \ u017F
范围或非单词字符.
(^|[^\u00C0-\u017F\w])
- our custom boundary (match a string start with^
or any character outside the[\u00C0-\u017F\w]
range)(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))
- a restriction on the match: the match is failed if there areAND
orOR
orNOT
followed by string end or characters other than those in the\u00C0-\u017F
range or non-word character([\u00C0-\u017F\w]+)
- match word characters ([a-zA-Z0-9_]
) or those from the\u00C0-\u017F
range(?=[^\u00C0-\u017F\w]|$)
- the trailing boundary, either string end ($
) or characters other than those in the\u00C0-\u017F
range or non-word character.
这篇关于正则表达式以匹配除AND,OR和NOT之外的所有单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!