使用Stanford CorNLP手动标记单词 [英] Manual tagging of Words using Stanford CorNLP
问题描述
我有一个资源,可以确切地知道单词的类型.我必须对它们进行词法化处理,但是为了获得正确的结果,我必须手动对其进行标记.我找不到用于手动标记单词的任何代码.我正在使用以下代码,但返回错误结果.即绘画"为绘画",而我期望绘画".
I have a resource where i know exactly the types of words. i have to lemmatize them but for correct results, i have to manually tag them. i could not find any code for manual tagging of words. i m using following code but it returns wrong result. i.e "painting" for "painting" where i expect "paint".
*//...........lemmatization starts........................
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences)
{
for(CoreLabel token: sentence.get(TokensAnnotation.class))
{
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println("lemmatized version :" + lemma);
}
}
//...........lemmatization ends.........................*
我必须对单词(而不是句子)运行lemmatizer,而pos标记将自动完成.因此,我将首先手动标记单词,然后找到它们的引理.某些示例代码的帮助或对某些站点的引用会很有帮助.
i have to run lemmatizer on words and not sentences where pos tagging will be done automatically. so i would first manually tag the words and then find their lemma. help with some sample code or reference to some site would be great help.
推荐答案
如果您事先知道POS标签,则可以通过以下方式获取引言:
If you know the POS tags in advance you can get the lemmata the following way:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Morphology morphology = new Morphology();
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences) {
for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
String word = token.get(TextAnnotation.class);
String tag = ... //get the tag for the current word from somewhere, e.g. an array
String lemma = morphology.lemma(word, tag);
System.out.println("lemmatized version :" + lemma);
}
}
如果只想获取单个单词的引理,甚至不必运行CoreNLP来进行标记化和句子拆分,那么您可以按以下方式调用lemma函数:
In case you only want to get the lemma of a single word, you don't even have to run CoreNLP for tokenizing and sentence-splitting, so you could just call the lemma function as following:
String tag = "VBG";
String word = "painting";
Morphology morphology = new Morphology();
String lemma = morphology.lemma(word, tag);
这篇关于使用Stanford CorNLP手动标记单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!