使用Stanford CorNLP手动标记单词 [英] Manual tagging of Words using Stanford CorNLP

查看:105
本文介绍了使用Stanford CorNLP手动标记单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个资源,可以确切地知道单词的类型.我必须对它们进行词法化处理,但是为了获得正确的结果,我必须手动对其进行标记.我找不到用于手动标记单词的任何代码.我正在使用以下代码,但返回错误结果.即绘画"为绘画",而我期望绘画".

I have a resource where i know exactly the types of words. i have to lemmatize them but for correct results, i have to manually tag them. i could not find any code for manual tagging of words. i m using following code but it returns wrong result. i.e "painting" for "painting" where i expect "paint".

*//...........lemmatization starts........................

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting"; 
Annotation document = pipeline.process(text);  

List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);

for(edu.stanford.nlp.util.CoreMap sentence: sentences) 

{    
    for(CoreLabel token: sentence.get(TokensAnnotation.class))
    {       
        String word = token.get(TextAnnotation.class);      
        String lemma = token.get(LemmaAnnotation.class); 
        System.out.println("lemmatized version :" + lemma);
    }
}

//...........lemmatization ends.........................*

我必须对单词(而不是句子)运行lemmatizer,而pos标记将自动完成.因此,我将首先手动标记单词,然后找到它们的引理.某些示例代码的帮助或对某些站点的引用会很有帮助.

i have to run lemmatizer on words and not sentences where pos tagging will be done automatically. so i would first manually tag the words and then find their lemma. help with some sample code or reference to some site would be great help.

推荐答案

如果您事先知道POS标签,则可以通过以下方式获取引言:

If you know the POS tags in advance you can get the lemmata the following way:

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";

Morphology morphology = new Morphology();

Annotation document = pipeline.process(text);  

List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);

for(edu.stanford.nlp.util.CoreMap sentence: sentences) {

  for(CoreLabel token: sentence.get(TokensAnnotation.class)) {       
    String word = token.get(TextAnnotation.class);
    String tag = ... //get the tag for the current word from somewhere, e.g. an array
    String lemma = morphology.lemma(word, tag);
    System.out.println("lemmatized version :" + lemma);
  }
}

如果只想获取单个单词的引理,甚至不必运行CoreNLP来进行标记化和句子拆分,那么您可以按以下方式调用lemma函数:

In case you only want to get the lemma of a single word, you don't even have to run CoreNLP for tokenizing and sentence-splitting, so you could just call the lemma function as following:

String tag = "VBG";      
String word = "painting";
Morphology morphology = new Morphology();
String lemma = morphology.lemma(word, tag);

这篇关于使用Stanford CorNLP手动标记单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆