斯坦福大学使用NLTK命名实体识别器(NER)功能 [英] Stanford Named Entity Recognizer (NER) functionality with NLTK
问题描述
这是否可能:仅使用NLTK来获得(类似于)斯坦福命名实体识别器功能?
Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?
有什么例子吗?
我尤其对提取文本的LOCATION部分感兴趣.例如,来自文本
In particular, I am interested in extraction LOCATION part of text. For example, from text
会议将于12345在南卡罗来纳州西威斯汀大街22号举行 在11月18日
The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18
理想情况下,我想得到类似的东西
ideally I would like to get something like
(S
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION
.....
或者简单地
22 West Westin st., South Carolina, 12345
相反,我只能获得
(S
The/DT
meeting/NN
will/MD
be/VB
held/VBN
at/IN
22/CD
(LOCATION West/NNP Westin/NNP)
st./NNP
,/,
(GPE South/NNP Carolina/NNP)
,/,
12345/CD
on/IN
Nov.-18/-NONE-)
请注意,如果我在 http://nlp.stanford.edu:8080/中输入文字ner/process 我得到的结果远非完美(街道编号和邮政编码仍然缺失),但至少是"st".是LOCATION的一部分,而南卡罗来纳州是LOCATION,而不是某些"GPE/NNP":?
Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?
请问我在做什么错?我该如何解决它,以便使用NLTK从某些文本中提取位置信息?
What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?
非常感谢!
推荐答案
nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger
.
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
输出:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
但是,每次调用tag
时,nltk只需将目标语句写入文件中,然后运行Stanford NER命令行工具来解析该文件,最后将输出解析回python.因此,加载分类器的开销(每次大约1分钟)是不可避免的.
However every time you call tag
, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.
如果出现问题,请使用 Pyner .
If that's a problem, use Pyner.
首先将Stanford NER作为服务器运行
First run Stanford NER as a server
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191
然后转到pyner
文件夹
import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
希望这会有所帮助.
这篇关于斯坦福大学使用NLTK命名实体识别器(NER)功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!