是否有可能获得一组包含短语的特定命名实体令牌 [英] Is it possible to get a set of a specific named entity tokens that comprise a phrase

查看:97
本文介绍了是否有可能获得一组包含短语的特定命名实体令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Stanford CoreNLP解析器来遍历某些文本,并且有一些日期短语,例如"10月的第二个星期一"和过去的一年".该库会将每个标记适当地标记为DATE命名实体,但是有没有办法以编程方式获取整个日期短语?这不仅仅是日期,组织命名的实体也将这样做(例如,国际奥委会"可以在给定的文本示例中标识出来).

I'm using the Stanford CoreNLP parsers to run through some text and there are date phrases, such as 'the second Monday in October' and 'the past year'. The library will appropriately tag each token as a DATE named entity, but is there a way to programmatically get this whole date phrase? And it's not just dates, ORGANIZATION named entities will do the same ("The International Olympic Committee", for example, could be one identified in a given text example).

String content = "Thanksgiving, or Thanksgiving Day (Canadian French: Jour de"
        + " l'Action de grâce), occurring on the second Monday in October, is"
        + " an annual Canadian holiday which celebrates the harvest and other"
        + " blessings of the past year.";

Properties p = new Properties();
p.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(p);

Annotation document = new Annotation(content);
pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        String word = token.get(CoreAnnotations.TextAnnotation.class);
        String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

        if (ne.equals("DATE")) {
            System.out.println("DATE: " + word);
        }

    }
}

在加载斯坦福注释器和分类器之后,将产生以下输出:

Which, after the Stanford annotator and classifier loading, will yield the output:

DATE: Thanksgiving
DATE: Thanksgiving
DATE: the
DATE: second
DATE: Monday
DATE: in
DATE: October
DATE: the
DATE: past
DATE: year

我觉得图书馆必须识别出这些短语并将其用于命名的实体标签,所以问题在于是否可以通过api保留并以某种方式获得数据?

I feel like the library has to be recognizing the phrases and using them for the named entity tagging, so the question would be is that data kept and available somehow through the api?

谢谢, 凯文

推荐答案

在对邮件列表进行讨论之后,我发现api不支持此功能.我的解决方案是仅保留最后一个网元的状态,并在必要时构建一个字符串.来自nlp邮件列表的John B.有助于回答我的问题.

After discussions on the mailing list I've found that the api does not support this. My solution was to just keep the state of the last NE, and build a string if necessary. John B. from the nlp mailing lists was helpful in answering my question.

这篇关于是否有可能获得一组包含短语的特定命名实体令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆