名称提取-简历/简历-斯坦福大学NER/OpenNLP [英] Name Extraction - CV/Resume - Stanford NER/OpenNLP

查看:240
本文介绍了名称提取-简历/简历-斯坦福大学NER/OpenNLP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在一个学习项目中,用于从其简历/简历中提取个人姓名.

目前,我正在与Stanford-NER和OpenNLP一起工作,它们在开箱即用的情况下都取得了一定程度的成功,并且倾向于使用非西方"类型的名称(对任何人都没有冒犯的意图).

我的问题是-鉴于简历/简历中普遍缺乏与个人姓名相关的句子结构或上下文,通过创建类似于简历语料库的东西,我是否可能在姓名识别方面获得显着改善?

我最初的想法是,我可能会通过句子拆分,删除明显的文字并运用一些逻辑以对人的名字做出最佳猜测来取得更大的成功.

我可以看到,如果一个名称出现在结构化的句子中,那么培训将如何进行,但是作为一个没有上下文的独立实体(例如,Akbar Agho),我怀疑无论培训如何,它都会很费劲.

是否存在一定水平的AI,如果给出足够的数据,这些AI将开始为寻找名称提供一种模式,还是我应该只应用基于逻辑的字符串提取水平?

我很感谢人们的想法,意见和建议.

旁注:我一直将PHP与Appache Tika一起使用来从Doc/Pdf中进行初始文本提取,并且正在通过PHP/Commandline与Stanford和OpenNLP进行实验.

克里斯

解决方案

我的2cents问题.

因此,坚持使用上面列出的NER标记器将是我在管道中的第一个步骤,如果我可以在那儿识别问题,Viola,那么如果不需要,那么我建议您使用基于规则的方法. 当我们谈论简历时,候选人的姓名通常在简历的前10%行中.在许多情况下,其后还会跟有"名称:Ankit Solanki".如果失败,请尝试查找电子邮件地址,并将其与您从简历中其他文本中获得的不同NP对匹配,将其与您找到的最接近的匹配项应该是您的名字,因为在大多数情况下,出于专业目的(例如简历)的人的电子邮件地址将有其名字,例如 john.mayer89abc.com 将被清除为 john .mayer ,该算法依次通过一个算法,该算法将找到与清理后的电子邮件名称最接近的名词短语.

让我知道您对此的想法.

最好

Ankit

I'm currently on a learning project to extract an individuals name from their CV/Resume.

Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody).

My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus?

My initial thoughts are that I'd probably have a more success by sentence splitting, removing obvious text and applying a bit of logic to make a best guess on the individual's name.

I can see how training would work if the a name appears in within a structured sentence, however as a standalone entity without context (Akbar Agho for example) I suspect it will struggle regardless of the training.

Is there a level of AI that if given enough data would begin to formulate a pattern for finding a name or should I maybe just go for applying a level of logic based string extraction?

I'd appreciate people's thoughts, opinions and suggestions.

Side note: I having been using PHP with Appache Tika to do the initial text extraction from Doc/Pdf and am experimenting with Stanford and OpenNLP via PHP/Commandline.

Chris

解决方案

My 2cents on the problem.

So sticking to the NER taggers you listed above will be my first block in the pipeline, if I can identify things there, Viola, no need to go ahead if not then I suggest you go for a Rule Based Approach. When we speak about a resume, name of a candidate is generally in the top 10% lines of the resume. In many cases it is also followed by "Name : Ankit Solanki " If that fails try to find the email address and match that with different NP pairs you get from other text in the resume, the one with the closest match you find should be your name, As in majority of cases email address of people for professional purpose like a resume will have their name, example john.mayer89abc.com will get cleaned to john.mayer which in turn goes through a algo which will find the closest noun phrase to the cleaned email name.

Let me know your thoughts on this.

Best,

Ankit

这篇关于名称提取-简历/简历-斯坦福大学NER/OpenNLP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆