正则表达式:由嵌套括号和分号分隔 [英] Regex: Separated by nested parentheses and semicolon
问题描述
我的字符串如下所示(每一行是一个示例性字符串):
My strings look like the following (each row is one exemplrary string):
Smith, Anna (Univ Cambridge); Doe, Jane (Univ Vienna (Austria)); Doe, John (Univ Tokyo; MIT)
Mueller, Hans (FU Berlin (Germany)); Schmid, Julia (); Doe, John (CalTech); Boe, Jane (TU Wien)
Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)
换句话说,该模式包括 Surname,Name(Affiliation);
(或如果没有其他人,则不包含;
),从而可以可选地嵌套括号(())
或包含;
或为空的()
.
In other words, the pattern comprises Surname, Name (Affiliation);
(or without the ;
if no other person follows), whereby the parentheses may be optionally nested ( () )
or contain a ;
or be empty ()
.
我想提取每个名称和隶属关系,如下所示:
I want to extract each name and affiliation, as in:
Smith, Anna (Univ Cambridge)
Doe, Jane (Univ Vienna (Austria))
Doe, John (Univ Tokyo; MIT)
Mueller, Hans (FU Berlin (Germany))
Schmid, Julia ()
Doe, John (CalTech)
Boe, Jane (TU Wien)
Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)
执行此操作的正确RegEx是什么?
What would be the correct RegEx to do this?
我尝试使用(?< = \()(?:[^()] + | \([^)] + \))+
效果不佳...
My attempt with (?<=\()(?:[^()]+|\([^)]+\))+
did not work well...
推荐答案
由于预期的匹配项只能具有一个嵌套的括号级别,因此可以使用
Since your expected matches can only have one nested parentheses level, you can use
\w+,\s*\w+\s*\([^()]*(?:\([^()]*\)[^()]*)*\);?
请参见 regex演示.
根据您的正则表达式库是否支持递归或平衡结构,可以进一步增强它以匹配任何深度的括号.
Depending on whether or not your regex library supports recursion, or balanced constructs, this can be further enhanced to match parenthetical phrases of any depth.
详细信息:
-
\ w +
-一个或多个单词字符 -
,
-逗号 -
\ s *
-零个或多个空格 -
\ w + \ s *
-一个或多个单词,然后零个或多个空格字符 -
\(
-一个(
char -
[^()] *
-除(
和)
以外的零个或多个字符 -
(?:\([^()] * \)[^()] *)*
-零个或多个(...)
子字符串序列中间没有(
和)
,然后零个或多个字符,而不是(
和)
-
\);?
-一个)
,然后是一个可选的;
.
\w+
- one or more word chars,
- a comma\s*
- zero or more whitespaces\w+\s*
- one or more word and then zero or more whitespace chars\(
- a(
char[^()]*
- zero or more chars other than(
and)
(?:\([^()]*\)[^()]*)*
- zero or more sequences of(...)
substrings with no(
and)
in between and then zero or more chars other than(
and)
\);?
- a)
and then an optional;
.
这篇关于正则表达式:由嵌套括号和分号分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!