如何为斯坦福关系提取生成自定义训练数据 [英] How to generate custom training data for Stanford relation extraction

查看:120
本文介绍了如何为斯坦福关系提取生成自定义训练数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经训练了一个自定义分类器,以了解金融领域中的命名实体.我想生成自定义训练数据,如下面的链接所示 http://cogcomp.cs.illinois.edu/Data/ER/conll04. corp

I have trained a custom classifier to understand named entities in finance domain. I want to generate custom training data like shown in below link http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp

我可以手动标记自定义关系,但希望首先使用自定义命名实体生成像conll这样的数据格式.

I can mark the custom relation by hand but want to generate the data format like conll first with my custom named entities.

我还按照以下方式尝试了解析器,但它不会生成关系训练数据,如链接

I have also tried the parser in the following way but that does not generate the relation training data like Roth and Yih's data mentioned in link https://nlp.stanford.edu/software/relationExtractor.html#training.

java -mx150m -cp"stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat"penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt> testsent.tree

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree

java -mx150m -cp"stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx

以下是使用以下python代码单独运行自定义ner的输出

Following is the output of custom ner run separate with the following python code

'java -mx2g -cp "*" edu.stanford.nlp.ie.NERClassifierCombiner '\
                '-ner.model classifiers\custom-model.ser.gz '\
                'classifiers/english.all.3class.distsim.crf.ser.gz,'\
                'classifiers/english.conll.4class.distsim.crf.ser.gz,'\
                'classifiers/english.muc.7class.distsim.crf.ser.gz ' \
                '-textFile '+ outtxt_sent +  ' -outputFormat inlineXML  > ' + outtxt + '.ner'

output:

<PERSON>Charles Sinclair</PERSON> <DESG>Chairman</DESG> <ORGANIZATION>-LRB- age 68 -RRB- Charles was appointed a</ORGANIZATION> <DESG>non-executive director</DESG> <ORGANIZATION>in</ORGANIZATION>

因此,即使我有Java代码对其进行测试,NER仍可以独立运行.

So the NER is working standalone fine even i have java code to test it out.

这是用于关系数据生成的详细代码

Here is the detailed code for relation data generation

Properties props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
        props.setProperty("ner.model", "classifiers/custom-model.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz");
        // set up Stanford CoreNLP pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // build annotation for a review
        Annotation annotation = new Annotation("Charles Sinclair Chairman -LRB- age 68 -RRB- Charles was appointed a non-executive director");
        pipeline.annotate(annotation);
        int sentNum = 0;

.............. Rest of the code is same as yours

output:
0   PERSON  0   O   NNP/NNP Charles/Sinclair    O   O   O
0   PERSON  1   O   NNP Chairman    O   O   O
0   PERSON  2   O   -LRB-/NN/CD/-RRB-/NNP/VBD/VBN/DT    -LRB-/age/68/-RRB-/Charles/was/appointed/a  O   O   O
0   PERSON  3   O   JJ/NN   non-executive/director  O   O   O

O   3   member_of_board //I will modify the relation once the data generated with proper NER

The Ner tagging is ok now.  
 props.setProperty("ner.model", "classifiers/classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,");

自定义NER问题已解决.

Custom NER problem solved.

推荐答案

此链接显示了数据示例:

This link shows an example of the data: http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp

我认为Stanford CoreNLP没有办法产生这种情况.

I don't think there is a way to produce this in Stanford CoreNLP.

标记数据后,您需要遍历句子并以相同的格式打印出标记,包括词性标记和ner标记.看来大多数列中都带有"O".

After you tag the data, you need to loop through the sentences and print out the tokens in that same format, including the part-of-speech tag and the ner tag. It appears most of the columns have a "O" in them.

对于每个具有关系的句子,您需要以关系格式在句子之后打印一行.例如,此行表示上一句话具有Live_In关系:

For each sentence that has a relationship you need to print out the a line after the sentence in the relation format. For instance this line indicates the previous sentence has the Live_In relationship:

7    0    Live_In

这是一些示例代码,用于生成句子的输出.您需要通过将ner.model属性设置为自定义模型的路径来设置管道以使用ner模型.警告:此代码中可能存在一些错误,但是它应该显示如何从StanfordCoreNLP数据结构访问所需的数据.

Here is some example code to generate the output for a sentence. You will need to set the pipeline to use your ner model instead by setting the ner.model property to the path of your custom model. WARNING: There may be some bugs in this code, but it should show how to access the data you need from the StanfordCoreNLP data structures.

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

import java.util.*;
import java.util.stream.Collectors;

public class CreateRelationData {

  public static void main(String[] args) {
    // set up pipeline properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    // set up Stanford CoreNLP pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // build annotation for a review
    Annotation annotation = new Annotation("Joe Smith lives in Hawaii.");
    pipeline.annotate(annotation);
    int sentNum = 0;
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      int tokenNum = 1;
      int elementNum = 0;
      int entityNum = 0;
      CoreMap currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
      String currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
          collect(Collectors.joining("/"));
      String currEntityMentionTags =
          currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
              collect(Collectors.joining("/"));
      String currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
      while (tokenNum <= sentence.get(CoreAnnotations.TokensAnnotation.class).size()) {
        if (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).get(0).index() == tokenNum) {
          String entityText = currEntityMention.toString();
          System.out.println(sentNum+"\t"+currEntityMentionNER+"\t"+elementNum+"\t"+"O\t"+currEntityMentionTags+"\t"+
              currEntityMentionWords+"\t"+"O\tO\tO");
          // update tokenNum
          tokenNum += (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).size());
          // update entity if there are remaining entities
          entityNum++;
          if (entityNum < sentence.get(CoreAnnotations.MentionsAnnotation.class).size()) {
            currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
            currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
                collect(Collectors.joining("/"));
            currEntityMentionTags =
                currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
                    collect(Collectors.joining("/"));
            currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
          }
        } else {
          CoreLabel token = sentence.get(CoreAnnotations.TokensAnnotation.class).get(tokenNum-1);
          System.out.println(sentNum+"\t"+token.ner()+"\t"+elementNum+"\tO\t"+token.tag()+"\t"+token.word()+"\t"+"O\tO\tO");
          tokenNum += 1;
        }
        elementNum += 1;
      }
      sentNum++;
    }
    System.out.println();
    System.out.println("O\t3\tLive_In");
  }
}

这篇关于如何为斯坦福关系提取生成自定义训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆