如何用括号外的逗号分割字符串? [英] How to split a string by commas positioned outside of parenthesis?

查看:46
本文介绍了如何用括号外的逗号分割字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到一个这样格式的字符串:

威尔伯·史密斯(比利,约翰的儿子)、艾迪·墨菲(约翰)、埃尔维斯·普雷斯利、简·多伊(Jane Doe)"

所以基本上它是演员的名字列表(可选地在括号中跟上他们的角色).角色本身可以包含逗号(演员的名字不能,我强烈希望如此).

我的目标是将这个字符串分成一对列表 - (actor name, actor role).

一个明显的解决方案是遍历每个字符,检查 '(', ')'',' 的出现并在出现外部逗号时将其拆分.不过这个好像有点重...

我正在考虑使用正则表达式拆分它:首先用括号拆分字符串:

导入重新x =威尔伯·史密斯(比利,约翰的儿子)、艾迪·墨菲(约翰)、埃尔维斯·普雷斯利、简·多伊(Jane Doe)"s = re.split(r'[()]', x)# ['威尔伯史密斯','比利,约翰的儿子',',艾迪墨菲','约翰',',埃尔维斯普雷斯利,简多伊','简多伊','']

这里奇怪的元素是演员的名字,甚至是角色.然后我可以用逗号分割名称并以某种方式提取名称-角色对.但这似乎比我的第一种方法更糟.

有没有更简单/更好的方法来做到这一点,无论是使用单个正则表达式还是一段漂亮的代码?

解决方案

一种方法是使用 findall 和一个正则表达式,贪婪地匹配可以在分隔符之间移动的东西.例如:

<预><代码>>>>s =威尔伯·史密斯(比利,约翰的儿子)、艾迪·墨菲(约翰)、埃尔维斯·普雷斯利、简·多伊(Jane Doe)">>>r = re.compile(r'(?:[^,(]|\([^)]*\))+')>>>r.findall(s)['威尔伯史密斯(比利,约翰的儿子)','艾迪墨菲(约翰)','猫王普雷斯利','简多伊(简多伊)']

上面的正则表达式匹配一个或多个:

  • 非逗号、非开括号字符
  • 以开括号开头的字符串,包含 0 个或多个非闭括号,然后是闭括号

这种方法的一个怪癖是相邻的分隔符被视为单个分隔符.也就是说,您不会看到空字符串.根据您的用例,这可能是错误或功能.

另请注意,正则表达式适用于可能嵌套的情况.例如,这会错误地拆分:

威尔伯·史密斯(约翰的儿子(约翰尼,詹姆斯的儿子),又名比利),艾迪·墨菲(约翰)"

如果您需要处理嵌套问题,最好的办法是将字符串划分为括号、逗号和其他任何内容(本质上是标记它——这部分仍然可以使用正则表达式完成),然后遍历这些标记重新组装字段,随时跟踪您的嵌套级别(这种跟踪嵌套级别是正则表达式无法自行完成的).

I got a string of such format:

"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"

so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).

My goal is to split this string into a list of pairs - (actor name, actor role).

One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...

I was thinking about spliting it using a regexp: first split the string by parenthesis:

import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x) 
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']

The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.

Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

解决方案

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:

  • non-comma, non-open-paren characters
  • strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:

"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

这篇关于如何用括号外的逗号分割字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆