NLTK用于命名实体识别 [英] NLTK for Named Entity Recognition

查看:515
本文介绍了NLTK用于命名实体识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用NLTK工具包从文本消息中获取提取位置,日期和时间.我刚刚在机器上安装了工具包,并编写了以下快速代码段进行测试:

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:

sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)

我当时假设它将确定日期(明天)和时间(晚上9点).但是,令人惊讶的是它没有意识到这一点.运行上面的代码时,我得到以下结果:

I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:

(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)

有人可以帮助我了解我是否遗漏了一些东西,或者NLTK尚不成熟,无法正确标记时间和日期.谢谢!

Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

推荐答案

nltk中的默认NE分块器是在ACE语料库上训练的最大熵分块器(

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.

看看 http://mattshomepage.com/articles/2016/May/23/nltk_nec/,对整个过程进行了很好的说明.

Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.

此外,nltk_contrib中还有一个名为timex的模块,可能会帮助您满足需要. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

这篇关于NLTK用于命名实体识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆