姓名提取 - 简历/简历 - 斯坦福 NER/OpenNLP [英] Name Extraction - CV/Resume - Stanford NER/OpenNLP

查看:41
本文介绍了姓名提取 - 简历/简历 - 斯坦福 NER/OpenNLP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在进行一个学习项目,从他们的简历/简历中提取个人姓名.

I'm currently on a learning project to extract an individuals name from their CV/Resume.

目前我正在与 Stanford-NER 和 OpenNLP 合作,这两家公司都在开箱即用方面取得了一定程度的成功,倾向于在非西方"类型名称上挣扎(无意冒犯任何人).

Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody).

我的问题是 - 鉴于在简历/简历中普遍缺乏与个人姓名相关的句子结构或上下文,我是否有可能通过创建类似于简历语料库的内容来显着改善姓名识别?

My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus?

我最初的想法是,通过拆分句子、删除明显的文本并应用一些逻辑来对个人姓名进行最佳猜测,我可能会取得更大的成功.

My initial thoughts are that I'd probably have a more success by sentence splitting, removing obvious text and applying a bit of logic to make a best guess on the individual's name.

如果 a 名称出现在结构化句子中,我可以看到培训的工作方式,但是作为没有上下文的独立实体(例如 Akbar Agho),我怀疑无论培训如何,它都会遇到困难.

I can see how training would work if the a name appears in within a structured sentence, however as a standalone entity without context (Akbar Agho for example) I suspect it will struggle regardless of the training.

是否存在某种级别的 AI,如果给定足够的数据将开始制定用于查找名称的模式,或者我是否应该只应用基于逻辑的字符串提取级别?

Is there a level of AI that if given enough data would begin to formulate a pattern for finding a name or should I maybe just go for applying a level of logic based string extraction?

我很感激人们的想法、意见和建议.

I'd appreciate people's thoughts, opinions and suggestions.

旁注:我一直在使用 PHP 和 Appache Tika 从 Doc/Pdf 中进行初始文本提取,并正在通过 PHP/Commandline 试验斯坦福和 OpenNLP.

Side note: I having been using PHP with Appache Tika to do the initial text extraction from Doc/Pdf and am experimenting with Stanford and OpenNLP via PHP/Commandline.

克里斯

推荐答案

我对这个问题的 2cents.

My 2cents on the problem.

所以坚持使用上面列出的 NER 标记器将是我在管道中的第一个块,如果我能识别那里的东西,Viola,如果不能,则无需继续,那么我建议您采用基于规则的方法.当我们谈论简历时,候选人的名字通常位于简历的前 10% 行.在许多情况下,它后面还跟有姓名:Ankit Solanki"如果失败,请尝试找到电子邮件地址并将其与您从简历中的其他文本中获得的不同 NP 对匹配,带有您找到的最接近的匹配项应该是您的姓名,因为在大多数情况下,出于专业目的(如简历)的人的电子邮件地址将有他们的姓名,例如 john.mayer89abc.com 将被清除为 john.mayer 依次经过一个算法,该算法将找到与清理过的电子邮件名称最接近的名词短语.

So sticking to the NER taggers you listed above will be my first block in the pipeline, if I can identify things there, Viola, no need to go ahead if not then I suggest you go for a Rule Based Approach. When we speak about a resume, name of a candidate is generally in the top 10% lines of the resume. In many cases it is also followed by "Name : Ankit Solanki " If that fails try to find the email address and match that with different NP pairs you get from other text in the resume, the one with the closest match you find should be your name, As in majority of cases email address of people for professional purpose like a resume will have their name, example john.mayer89abc.com will get cleaned to john.mayer which in turn goes through a algo which will find the closest noun phrase to the cleaned email name.

让我知道您对此的看法.

Let me know your thoughts on this.

最好,

Ankit

这篇关于姓名提取 - 简历/简历 - 斯坦福 NER/OpenNLP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆