人的名称解析 [英] Human Name parsing

查看:119
本文介绍了人的名称解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆人的名字。他们都是西方的名字,我只需要美国的惯例/缩写(先生,而不是老的阅兵式如)。不幸的是,人们对谁我送的东西没有输入自己的名字,所以我不能问他们喜欢什么被调用。我知道每个人都和他们的全名的性别,但还没有真正解析的事情了更具体。

I have a bunch of human names. They are all "Western" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the people to whom I am sending things did not input their own names so I can't ask them what they would like to be called. I know the gender of each person and their full name, but haven't really parsed things out more specifically.

一些例子:

  1. 约翰·史密斯
  2. 约翰·史密斯,JR。
  3. 在约翰·史密斯JR。
  4. 在约翰·史密斯XIV
  5. 博士。约翰·史密斯博士。

我希望能够分析出每个名字的部分:

I'd like to be able to parse out parts of each name:

name = Name.new("John Smith Jr.")
name.first_name # <= John
name.greeting   # <= Mr. Smith

如果我在寻找问候语(可能不是最好的词),我要在这里为1-4,史密斯先生。 5,我想史密斯博士,但我会满足于史密斯先生。

If I'm looking for "greeting" (probably not the best term), what I want here is, for 1-4, "Mr. Smith". For 5, I would like Dr. Smith but I'd settle for Mr. Smith.

一个Ruby的创业板,这将是理想的。我的灵感要问的东西,这个奇怪的慢性,一个Ruby宝石处理时间的显着人类的方式,让我正确地告诉它上周二并让它拿出一些明智的。有些算法就足够了命中最角落的情况。

A Ruby gem for this would be ideal. I was inspired to ask for something this strange by Chronic, a Ruby gem that handles time in a remarkably human way, letting me correctly tell it "last Tuesday" and having it come up with something sensible." Some algorithm would suffice that hits most of the corner cases.

我想处理一些psented中的谎言程序员认为有关的名字

I'm trying to deal with some of the issues presented in falsehoods programmers believe about names

推荐答案

既然你仅限于西式的名字,我想了一些规则,将让你最有方式:

Since you're limited to Western-style names, I think a few rules will get you most of the way there:

  1. 如果出现一个逗号,后删除最左边的一个,应有尽有。
  2. 在继续删除从一开始的话,同时,转换为小写并移除任何句号后,他们属于集合 {先生夫人错过毫秒转博士教授} 和任何越多,你能想到的。使用的标题是分数的表格(如 [MR = 1,杜= 1,转= 2,博士= 3,教授= 4] - 为了他们,但是你想要的),记录已删除的最高得分王的头衔。
  3. 继续从末端取下的话,而他们所属的集 {JR博士} 或者是有价值的罗马数字大约为50或更小( / [XVI] + / 可能是一个好足够的正则表达式)。
  4. 如果有非零分数的一个或多个头衔在步骤2中删除,用得分最高的之一。否则,用先生或夫人根据提供的性别。
  5. 为姓,使用的最后一个字。
  1. If a comma appears, delete the leftmost one and everything after.
  2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
  3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
  4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
  5. As the surname, use the last word.

这将永远不可能保证像约翰·巴克斯特·史密斯的名称被正确解析,因为不是所有的双管姓氏用连字符。是百特·史密斯的姓?或者是百特一个中间的名字吗?我认为它是安全的假设,中间名相对比双管,但是,unhyphenated姓氏比较常见的,这意味着它是更好的默认报告的最后一个字为姓。你可能想不过也编译常见的双管姓氏列表,并核对这个。

It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.

这篇关于人的名称解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆