使用NLTK中的Stanford NER Tagger提取人员和组织列表 [英] Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

查看:92
本文介绍了使用NLTK中的Stanford NER Tagger提取人员和组织列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python NLTK中的斯坦福命名实体识别器(NER)提取人员和组织的列表. 当我跑步时:

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r) 

输出为:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

我想要的是从此列表中以这种形式提取所有人员和组织:

what I want is to extract from this list all persons and organizations in this form:

Rami Eid
Sony Brook University

我试图遍历元组列表:

for x,y in i:
        if y == 'ORGANIZATION':
            print(x)

但是此代码仅每行打印一个实体:

But this code only prints every entity one per line:

Sony 
Brook 
University

有了真实的数据,一个句子中可以有多个组织,一个人,我该如何在不同实体之间设置界限?

With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

推荐答案

感谢接受的答案:

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

许多NER系统使用更复杂的标签,例如IOB标签,其中B-PERS之类的代码指示人员实体的起始位置. CRFClassifier类和功能工厂支持此类标签,但我们当前(截至2012年)所分发的模型中未使用它们

您有以下选择:

  1. 收集带有相同标签的单词;例如,所有标记为PERSON的相邻单词应一起作为一个命名实体.这很容易,但是当然有时会组合不同的命名实体. (例如,New York, Boston [and] Baltimore大约是三个城市,而不是一个城市.)这是Alvas的代码在公认的答案中所执行的操作.有关更简单的实现,请参见下文.

  1. Collect runs of identically tagged words; e.g., all adjacent words tagged PERSON should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimore is about three cities, not one.) This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.

使用nltk.ne_recognize().它不使用斯坦福识别器,但使用大块实体. (它是一个围绕IOB的名为实体标记器的包装器).

Use nltk.ne_recognize(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).

在斯坦福标记器返回的结果的基础上找出一种进行自己的分块的方法.

Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.

为您感兴趣的域训练自己的IOB命名实体分块器(使用Stanford工具或NLTK的框架).如果您有足够的时间和资源来执行此操作,则可能会给您最好的结果.

Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

编辑:如果您只想提取连续命名实体的运行(上述选项1),则应使用itertools.groupby:

If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

如果您的问题中的(word, type)元组列表是netagged_words,则会产生:

If netagged_words is the list of (word, type) tuples in your question, this produces:

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

再次注意,如果两个相同类型的命名实体彼此相邻出现,则此方法会将它们组合在一起.例如. New York, Boston [and] Baltimore大约是三个城市,而不是一个.

Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore is about three cities, not one.

这篇关于使用NLTK中的Stanford NER Tagger提取人员和组织列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆