正则表达式:由嵌套括号和分号分隔 [英] Regex: Separated by nested parentheses and semicolon

查看:102
本文介绍了正则表达式:由嵌套括号和分号分隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的字符串如下所示(每一行是一个示例性字符串):

My strings look like the following (each row is one exemplrary string):


Smith, Anna (Univ Cambridge); Doe, Jane (Univ Vienna (Austria)); Doe, John (Univ Tokyo; MIT)

Mueller, Hans (FU Berlin (Germany)); Schmid, Julia (); Doe, John (CalTech); Boe, Jane (TU Wien)

Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)

换句话说,该模式包括 Surname,Name(Affiliation); (或如果没有其他人,则不包含; ),从而可以可选地嵌套括号(())或包含; 或为空的().

In other words, the pattern comprises Surname, Name (Affiliation); (or without the ; if no other person follows), whereby the parentheses may be optionally nested ( () ) or contain a ; or be empty ().

我想提取每个名称和隶属关系,如下所示:

I want to extract each name and affiliation, as in:


Smith, Anna (Univ Cambridge)
Doe, Jane (Univ Vienna (Austria))
Doe, John (Univ Tokyo; MIT)
Mueller, Hans (FU Berlin (Germany))
Schmid, Julia ()
Doe, John (CalTech)
Boe, Jane (TU Wien)
Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)

执行此操作的正确RegEx是什么?

What would be the correct RegEx to do this?

我尝试使用(?< = \()(?:[^()] + | \([^)] + \))+ 效果不佳...

My attempt with (?<=\()(?:[^()]+|\([^)]+\))+ did not work well...

推荐答案

由于预期的匹配项只能具有一个嵌套的括号级别,因此可以使用

Since your expected matches can only have one nested parentheses level, you can use

\w+,\s*\w+\s*\([^()]*(?:\([^()]*\)[^()]*)*\);?

请参见 regex演示.

根据您的正则表达式库是否支持递归或平衡结构,可以进一步增强它以匹配任何深度的括号.

Depending on whether or not your regex library supports recursion, or balanced constructs, this can be further enhanced to match parenthetical phrases of any depth.

详细信息:

  • \ w + -一个或多个单词字符
  • -逗号
  • \ s * -零个或多个空格
  • \ w + \ s * -一个或多个单词,然后零个或多个空格字符
  • \(-一个( char
  • [^()] * -除()
  • 以外的零个或多个字符
  • (?:\([^()] * \)[^()] *)* -零个或多个(...)子字符串序列中间没有(),然后零个或多个字符,而不是()
  • \);?-一个),然后是一个可选的; .
  • \w+ - one or more word chars
  • , - a comma
  • \s* - zero or more whitespaces
  • \w+\s* - one or more word and then zero or more whitespace chars
  • \( - a ( char
  • [^()]* - zero or more chars other than ( and )
  • (?:\([^()]*\)[^()]*)* - zero or more sequences of (...) substrings with no ( and ) in between and then zero or more chars other than ( and )
  • \);? - a ) and then an optional ;.

这篇关于正则表达式:由嵌套括号和分号分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆