使用Stanford NLP训练n-gram NER [英] Training n-gram NER with Stanford NLP

查看:358
本文介绍了使用Stanford NLP训练n-gram NER的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我一直在尝试使用Stanford Core NLP训练n-gram实体.我遵循了以下教程- http://nlp.stanford.edu/software/crf-faq.shtml#b

Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b

有了这个,我只能指定unigram标记及其所属的类.谁能引导我通过,以便将其扩展为n-gram.我正在尝试从聊天数据集中提取已知的实体,例如电影名称.

With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.

如果我误解了斯坦福教程,并且可以将其用于n-gram培训,请指导我.

Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.

我坚持使用的是以下属性

What I am stuck with is the following property

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

第一列是单词(字母),第二列是实体,例如

Here the first column is the word (unigram) and the second column is the entity, for example

CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS

现在,我需要将诸如 Hulk Titanic 之类的已知实体(例如电影名称)训练为电影,使用这种方法很容易.但是如果我需要训练我知道你去年夏天做了什么婴儿的一天,最好的方法是什么?

Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?

推荐答案

在这里等待了很长时间才能找到答案.我还无法找出使用斯坦福核心(Stanford Core)完成任务的方法.但是任务完成了.我已经使用了LingPipe NLP库.只是在这里引用答案,因为,我认为其他人也可以从中受益.

It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

请仔细阅读 Lingpipe许可,以防万一.您是开发人员或研究人员,还是其他任何人.

Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

Lingpipe提供了各种NER方法.

Lingpipe provides various NER methods.

1)基于字典的NER

1) Dictionary Based NER

2)统计NER(基于HMM)

2) Statistical NER (HMM Based)

3)基于规则的NER等.

3) Rule Based NER etc.

我已经使用了字典以及统计方法.

I have used the Dictionary as well as the statistical approaches.

第一个是直接查找方法,第二个是基于培训的方法.

First one is a direct look up methodology and the second one being a training based.

可以在此处

统计方法需要培训文件.我已使用以下格式的文件-

The statstical approach requires a training file. I have used the file with the following format -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

然后,我使用以下代码训练实体.

I then used the following code to train the entities.

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

为了测试NER,我使用了以下课程

And to test the NER I used the following class

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

代码礼貌:Google:)

Code Courtesy : Google :)

这篇关于使用Stanford NLP训练n-gram NER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆