斯坦福大学使用NLTK命名实体识别器(NER)功能 [英] Stanford Named Entity Recognizer (NER) functionality with NLTK

查看:732
本文介绍了斯坦福大学使用NLTK命名实体识别器(NER)功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是否可能:仅使用NLTK来获得(类似于)斯坦福命名实体识别器功能?

Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?

有什么例子吗?

我尤其对提取文本的LOCATION部分感兴趣.例如,来自文本

In particular, I am interested in extraction LOCATION part of text. For example, from text

会议将于12345在南卡罗来纳州西威斯汀大街22号举行 在11月18日

The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18

理想情况下,我想得到类似的东西

ideally I would like to get something like

(S  
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION

.....

或者简单地

22 West Westin st., South Carolina, 12345

相反,我只能获得

(S
  The/DT
  meeting/NN
  will/MD
  be/VB
  held/VBN
  at/IN
  22/CD
  (LOCATION West/NNP Westin/NNP)
  st./NNP
  ,/,
  (GPE South/NNP Carolina/NNP)
  ,/,
  12345/CD
  on/IN
  Nov.-18/-NONE-)

请注意,如果我在 http://nlp.stanford.edu:8080/中输入文字ner/process 我得到的结果远非完美(街道编号和邮政编码仍然缺失),但至少是"st".是LOCATION的一部分,而南卡罗来纳州是LOCATION,而不是某些"GPE/NNP":?

Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?

请问我在做什么错?我该如何解决它,以便使用NLTK从某些文本中提取位置信息?

What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?

非常感谢!

推荐答案

nltk 有斯坦福大学NER的界面,请检查

nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 

输出:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

但是,每次调用tag时,nltk只需将目标语句写入文件中,然后运行Stanford NER命令行工具来解析该文件,最后将输出解析回python.因此,加载分类器的开销(每次大约1分钟)是不可避免的.

However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.

如果出现问题,请使用 Pyner .

If that's a problem, use Pyner.

首先将Stanford NER作为服务器运行

First run Stanford NER as a server

java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191

然后转到pyner文件夹

import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'

希望这会有所帮助.

这篇关于斯坦福大学使用NLTK命名实体识别器(NER)功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆