NER模型识别印度名字 [英] NER model to recognize Indian names

查看:116
本文介绍了NER模型识别印度名字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我计划使用命名实体识别(NER)技术从给定的文本中识别人名(其中大多数是印度名).我已经研究了斯坦福大学NLP的基于CRF的NER模型,但是在识别印度名字时并不十分准确.因此,我决定通过监督培训来创建自己的自定义NER模型.对于如何使用斯坦福大学NER CRF创建自己的NER模型,我有一个不错的主意,但是我想避免创建带有手动注释的大型培训语料库,因为这是个人的巨大努力,其次是获得不同的人的名字来自印度不同州的挑战也是一个挑战.任何人都可以建议任何自动化/编程方式来准备带有至少100k印度名字的带标签的训练语料库吗?
我已经研究过Facebook和LinkedIn API,但没有找到从给定位置(例如印度)提取10万用户全名的方法.

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).

推荐答案

我最终完成了以下操作,以创建NER模型来识别印度名称.这对于寻求创建自定义NER模型以识别非英语人物姓名的任何人都是有用的,因为大多数公开可用的NER模型(例如来自Stanford NLP的NER模型)都使用英语名称进行了培训,因此在识别英语时更准确(英国/美国)名称.

I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.

  1. 在Twitter帐户上找到印度名人,并在Twitter中拥有大量追随者(就我而言,我选择了Sachin Tendulkar).
  2. 使用您选择的语言创建一个程序,以调用Twitter REST API(获取关注者/列表)以获取名人的所有关注者的名称并将其保存到文件中.我们可以肯定地认为,大多数追随者将是印第安人.请注意,有一个API速率限制(每15分钟窗口30个请求),因此应内置该程序来处理.对于我们的情况,我们将该程序开发为每15分钟运行一次的Windows服务.
  3. 由于某些Twitter用户的姓名可能不是有效的人名,因此建议添加一些基于规则的逻辑(如RegEx)以过滤看似真实的姓名,并将其仅添加到文件中.
  4. 生成具有真实名称的文件后,请创建另一个程序来创建训练数据文件,其中包含这些带有标签/标注为PERSON的名称以及非实体名称为OTHER的名称.如果您使用的是Stanford NER CRF分类器,则程序应生成一个包含两列的训练(TSV)文件-一列包含单词(令牌),第二列提及标签.
  5. 一旦以编程方式生成了训练语料库,您就可以通过以下链接创建自定义的NER模型来识别印度名称: http://nlp.stanford.edu/software/crf-faq.shtml#a
  1. Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
  2. Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
  3. Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
  4. Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
  5. Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names: http://nlp.stanford.edu/software/crf-faq.shtml#a

这篇关于NER模型识别印度名字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆