在两个名字可能具有相同姓氏的地方拆分名字列表 [英] Splitting List of Names where there Might Be Common Last Name for Two First Names

查看:57
本文介绍了在两个名字可能具有相同姓氏的地方拆分名字列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中,我正在解析一大堆名称,看起来像这样:

In Python, I'm parsing through a large list of names, something that looks like this:

[u' Ron Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton ',
 u' Barbara Loughlin, Dr. Gerald Loughlin, and Debbie Gelston ',
 u' Julianne Michelle 
    ... ']

我可以使用以下方法将其拆分成单独的名称:

I'm able to split these into individual names using this:

re.split('(([A-Z]\.?\s?)*([A-Z][a-z]+\.?\s?)+([A-Z]\.?\s?[a-z]*)*)', line)[1::5]

例如,如果我在上面的示例数据的第一个位置调用此函数,它将返回:

For example, if I called this on the first position of the sample data above, it returns:

[u'Ron Iervolino', u'Trish Iervolino', u'Russ Middleton', u'Lisa Middleton ']

很酷.这在很多情况下都适用.我遇到的问题是,在某些情况下,名称的格式为:

Cool. This works for a lot of cases. The issue I'm having is that there are some instances where the names are in the form:

[   ...,
 u' Kelly  and Tom Murro ',
    ...]

这是指凯利·穆罗(Kelly Murro)和汤姆·穆罗(Tom Murro).有什么方法可以指向我匹配此特定案例吗?我有一个执行正则表达式操作的函数(调用re.split),所以我的想法是将其添加到该函数中,并首先检查该情况是否存在.如果列表中有两个以上的名字,则似乎姓氏与两个名字都配对了.仅当列表中同时有两个(并且只有两个)名称并且它们共享姓氏时,才会出现这种情况.

This is referring to both Kelly Murro and Tom Murro. Any ideas on ways to point me to match this particular case? I have a function that does the regex operation (calls re.split), so my thought was to add to this function and check if that occurrence exists first. If there are more than two names in the list, it appears as if the last name is paired with both first names. This only seems to occur if there are both two (and only two) names in the list and they share a last name.

编辑

我喜欢"alpha bravo"解决方案的简单性.为了理解正在发生的事情,我弄弄了Regex101网站演示,并让它生成了一些代码.该代码似乎没有任何作用,也许我的大脑已经凝视了这么长时间而融化了.有什么建议吗?

I like the simplicity of "alpha bravo" solution. In trying to understand what's happening, I messed around with the Regex101 site demo and had it generate some code. The code doesn't appear to do anything, and maybe my brain is melting from staring at this for so long. Any suggestions?

import re
p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly  and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"

result = re.sub(p, subst, test_str)

变量 result 只是替换字符串.

推荐答案

作为首次匹配的更有效方法,您可以使用 str.split()(如果您的字符串已用):

As a more efficient way for your first match you can use str.split() (if your string has been split with , ):

>>> s=u' Ron Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton '
>>> [i.split('and')[1] if i.strip().startswith('and') else i for i in s.split(',')]
[u' Ron Iervolino', u' Trish Iervolino', u' Russ Middleton', u' Lisa Middleton ']

并在 u'Kelly and Tom Murro'中找到名称,您可以使用以下代码:

and for find the name in u' Kelly and Tom Murro ' you can use the following :

l=[]
s=u' Ron Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton ,Kelly  and Tom Murro'
import re
for i in s.split(','):
   i=i.strip()
   if i.startswith('and') :
      l.append(i.split('and')[1])
   elif not i.endswith('and') and 'and' in i :
      names=[i for i in re.split(r'and| ',i) if i]
      for t in zip(names[:-1],[names[-1] for i in range(len(names)-1)]):
          l.append(' '.join(t))
   else: 
      l.append(i)

print l
[u'Ron Iervolino', u'Trish Iervolino', u'Russ Middleton', u' Lisa Middleton', u'Kelly  Murro', u'Tom  Murro']

当您遇到诸如 u'Kelly和Tom Murro'之类的字符串时,首先将其拆分为具有 [i for i in re.split(r'and |',i)如果i] 根据'and' space 分割字符串,那么您将获得 [u'Kelly',u'汤姆,你'莫罗'] .然后根据需要使用以下名称:

When you encounter with strings like u' Kelly and Tom Murro ' first you split it to a list of names with [i for i in re.split(r'and| ',i) if i] that split the string based on 'and' , space so you will have [u'Kelly', u'Tom', u'Murro']. then as you want the following names :

u'Kelly  Murro'
u'Tom  Murro'

您可以创建一个zip文件,并重复从列表的开头到最后一个 names [:-1] 的最后一个元素和命名,这样您将获得以下内容.请注意,此食谱适用于最长的名称,例如( Kelly and Tom and rose and sarah Murro ):

you can create a zip file with repeat the last element and the named from begin of the list to last names[:-1] so you will have the following . note that this recipe work for longest names like (Kelly and Tom and rose and sarah Murro) :

[(u'Kelly', u'Murro'), (u'Tom', u'Murro')]

这篇关于在两个名字可能具有相同姓氏的地方拆分名字列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆