NER模型如何在简历/简历中查找人名? [英] What does NER model to find person names inside a resume/CV?

查看:220
本文介绍了NER模型如何在简历/简历中查找人名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用Stanford CoreNLP,我想建立一个自定义的NER模型来查找.

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

不幸的是,我没有找到一个很好的意大利语书呆子模型.我需要在简历/简历文档中找到这些实体.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

这里的问题是像这样的文档可以具有不同的结构,例如我可以具有:

The problem here is that document like those can have different structure, for example i can have:

案例1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

案例2

My name is John Travolta and I was born ...

基本上,我可以拥有结构化的数据(带有不同的标签)或可以找到这些实体的上下文.

Basically, i can have structured data (with different labels) or a context where i should find these entities.

这种文件的最佳方法是什么?在这种情况下,maxent模型可以工作吗?

What is the best approach for this kind of documents? Can a maxent model work in this case?

此刻,我采用这种策略来找到在左侧有一些东西而在右侧有一些东西的模式,按照这种方法,我有80/85%的位置可以找到实体.

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

示例:

Name: John
Birthdate: 2000-01-01

这意味着我在模式的左边有名称:",在右边有一个 \ n (直到找到 \ n ). 我可以创建很长的类似模式列表.我考虑过模式,因为我不需要其他"上下文中的名称.

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

例如,如果用户在工作经历中写下其他名称,则不需要.因为我要查找的是个人名称,而不是其他人.使用这种方法,我可以减少误报,因为我将查看特定的模式而不是通用名称".

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

此方法的一个问题是我有很多模式(1个模式= 1个正则表达式),因此如果我添加其他模式,伸缩性就不会很好.

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

如果我可以用所有这些模式训练NER模型,那就太好了,但是我应该使用大量文档来很好地训练它.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

推荐答案

第一种情况可能是微不足道的,我同意Ozborn的建议.

The first case could be trivial, and I agree with Ozborn's suggestion.

我想对案例2提出一些建议.
Stanford NLP提供了出色的英文姓名识别器,但可能无法找到所有人员姓名. OpenNLP也提供了不错的性能,但比斯坦福要差很多.还有许多其他的实体识别器可用于英语.我将在这里重点介绍StanfordNLP,这是需要考虑的几件事.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

  1. 宪报.您可以为模型提供名称列表,还可以自定义宪报"条目的匹配方式.斯坦福大学在设置时还提供了一个草率的匹配选项,将允许与Gazette条目进行部分匹配.部分匹配应该可以很好地配合人名使用.

  1. Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.

斯坦福大学以建设性方式认识实体.如果在文档中识别出像"John Travolta"之类的名称,那么即使在没有"Travolta"的先验知识的情况下,它在同一文档中也将得到"Travolta".因此,应将尽可能多的信息附加到文档中.在我的名字叫约翰·特拉沃尔塔"这样的熟悉上下文中,添加案例1识别的名字.如果案例1中使用的规则认可约翰·特拉沃尔塔".添加虚拟句子可以提高召回率.

Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

制定培训基准是一个非常昂贵且乏味的过程;您应该以成千上万的句子进行注释,以确保良好的测试性能.我敢肯定,即使您有一个使用带注释的训练数据训练的模型,其性能也不会比实现上述两个步骤时的性能更好.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

由于此问题的提问者对基于无监督模式的方法感兴趣,因此我将扩展其答案以讨论这些问题.

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

当没有监督数据时,通常使用一种称为引导模式学习方法的方法.该算法从一小部分感兴趣的种子实例开始(例如书籍列表),然后输出更多相同类型的实例.
请参阅以下资源以获取更多信息

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

  • SPIED 是使用上述技术的软件,可以下载和使用.
  • Sonal Gupta 获得博士学位.关于此主题,可以在此处获得她的论文.
  • 有关此主题的简要介绍,请参见以下幻灯片.
  • SPIED is a software that uses the above-described technique and is available for download and use.
  • Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
  • For a light introduction on this topic, see these slides.

谢谢

这篇关于NER模型如何在简历/简历中查找人名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆