如何识别小写的命名实体,例如CoreNLP的kobe bryant? [英] How to recognize a named entity that is lowcase such as kobe bryant by CoreNLP?

查看:123
本文介绍了如何识别小写的命名实体,例如CoreNLP的kobe bryant?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一个问题,CoreNLP只能识别以大写字符开头的命名实体,例如科比·布莱恩特,但不能将科比·布莱恩特识别为一个人!!!那么,如何通过CoreNLP识别以小写字符char开头的命名实体呢?谢谢!!!!

解决方案

首先,您必须接受与小写字母或大小写不一致的英文文本相比,用大写字母表示的命名实体更难获得正确的名称一个很好的线索. (这也是为什么中文NER比英文NER更难的原因之一.)尽管如此,您还需要做一些事情才能使CoreNLP在小写文本上正常工作-默认模型经过训练可以在编辑良好的文本上很好地工作./p>

如果要使用正确编辑的文本,则应使用我们的默认英语模型.如果您正在处理的文本(主要是小写或大写),则应使用下面提供的两种解决方案之一.如果是真正的混合体(例如许多社交媒体文本),则可以使用下面的truecaser解决方案,或者可以使用有壳和无壳NER模型的两者(作为一长串模型, ner.model属性.

方法1:没有案例的模型.我们还提供了忽略案例信息的英语模型.它们将在所有小写文本上更好地工作.

方法2:使用truecaser..我们提供了truecase注释器,它试图将文本转换为正式编辑的大写字母.您可以先应用它,然后使用常规注释器.

总的来说,我们不清楚这些方法中的一种通常会或总是会获胜.您可以两者都尝试.

重要提示::要提供以下调用的其他组件,您需要下载 解决方案

First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. (This is also one reason why Chinese NER is harder than English NER.) Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text.

If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model property).

Approach 1: Caseless models. We also provide English models that ignore case information. They will work much better on all lowercase text.

Approach 2: Use the truecaser. We provide a truecase annotator, which attempts to convert text into formally edited capitalization. You can apply it first, and then use the regular annotators.

In general, it's not clear to us that one of these approaches usually or always wins. You can try both.

Important: To have available the extra components invoked below, you need to have downloaded the English models jar, and to have it available on your classpath.

Here's an example. We start with a sample text:

% cat lakers.txt
lonzo ball talked about kobe bryant after the lakers game.

With the default models, no entities are found and all their words just get a common noun tag. Sad!

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
% cat lakers.txt.conll 
1   lonzo   lonzo   NN  O   _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NN  O   _   _
6   bryant  bryant  NN  O   _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  laker   NNS O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Below, we ask to use the caseless models, and then we're doing pretty well: All the name words are now recognized as proper nouns, and the two person names are recognized. But the team name is still missed.

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
% cat lakers.txt.conll 
1   lonzo   lonzo   NNP PERSON  _   _
2   ball    ball    NNP PERSON  _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NNP PERSON  _   _
6   bryant  bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  lakers  NNPS    O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Instead, you can run truecasing prior to POS tagging and NER:

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
% cat lakers.txt.conll 
1   Lonzo   Lonzo   NNP PERSON  _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   Kobe    Kobe    NNP PERSON  _   _
6   Bryant  Bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   Lakers  Lakers  NNPS    ORGANIZATION    _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Now, the organization Lakers is recognized, and in general nearly all the entity words are tagged as proper nouns with the correct entity label, but it fails to get ball, which remains a common noun. Of course, this is a fairly hard word to get right in caseless text, since ball is a quite frequent common noun.

这篇关于如何识别小写的命名实体,例如CoreNLP的kobe bryant?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆