在Java中为Maxent类文件创建训练数据 [英] Creating training data for a Maxent classfier in Java

查看:226
本文介绍了在Java中为Maxent类文件创建训练数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为maxent分类器创建java实现。我需要将句子分类为 n 不同的类。

I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n different classes.

我看了 stanford maxent分类器中的ColumnDataClassifier 。但我无法理解如何创建培训数据。我需要训练数据的形式,其中训练数据包括句子单词的POS标签,因此用于分类器的功能将类似于前一个单词,下一个单词等。

I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.

我正在寻找训练数据,其中包含POS TAGGING和句子类句子。例如:

I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :

我/(POS)名称/(POS)是/(POS)XYZ /(POS)CLASS

My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS

任何帮助将不胜感激。

推荐答案

如果我理解正确,你试图将句子视为一组POS标签。

If I understand it correctly, you are trying to treat sentences as a set of POS tags.

在您的示例中,句子我的名字是XYZ将表示为一组(PRP $,NN,VBZ,NNP)。
这意味着,每个句子实际上是长度为37的二进制向量(因为有根据此页面的36个可能的POS标签 +整个句子的CLASS结果特征)

In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP). That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)

这可以编码为OpenNLP Maxent,如下所示:

This can be encoded for OpenNLP Maxent as follows:

PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1

或简单地说:

PRP$ NN VBZ NNP CLASS=SomeClassOfYours1

(有关工作代码段,请参阅我在这里的回答:使用openNLP maxent培训模型

(For working code-snippet see my answer here: Training models using openNLP maxent)

更多的样本数据将是:


  1. 到1978年,无线电城已失去魅力,业主洛克菲勒中心决定拆除老化的大厅。

  2. 他完全被遗忘了,他的许多建筑物都被拆除了,其他建筑物也被无情地改变了。

  3. 搬出去后,移动房屋被拆除了,西装说。

  4. ...

  1. "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
  2. "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
  3. "As soon as she moved out, the mobile home was demolished, the suit said."
  4. ...

这会产生样本:

IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...

但是,我不认为这样的分类会产生良好的结果。最好利用句子的其他结构特征,例如可以使用例如句子获得的解析树或依赖树。 斯坦福解析器

However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.

编辑于2016年3月28日:
您也可以将整个句子用作训练样本。但是,要注意:
- 两个句子可能包含相同的单词但含义不同
- 有很高的机会过度拟合
- 你应该使用短句
- 你需要一个庞大的训练集

Edited on 28.3.2016: You can also use the whole sentence as a training sample. However, be aware that: - two sentences might contain same words but have different meaning - there is a pretty high chance of overfitting - you should use short sentences - you need a huge training set

根据你的例子,我将训练样本编码如下:

According to your example, I would encode the training samples as follows:

class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...

请注意,结果变量是每行的第一个元素。

Notice that the outcome variable comes as the first element on each line.

这是一个使用 opennlp-maxent-3.0的完全工作的最小示例。 3.jar

package my.maxent;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;

import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;

public class MaxentTest {


    public static void main(String[] args) throws IOException {

        String trainingFileName = "training-file.txt";
        String modelFileName = "trained-model.maxent.gz";

        // Training a model from data stored in a file.
        // The training file contains one training sample per line.
        DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName)); 
        MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations

        // Storing the trained model into a file for later use (gzipped)
        File outFile = new File(modelFileName);
        AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
        writer.persist();

        // Loading the gzipped model from a file
        FileInputStream inputStream = new FileInputStream(modelFileName);
        InputStream decodedInputStream = new GZIPInputStream(inputStream);
        DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
        MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();

        // Now predicting the outcome using the loaded model
        String[] context = {"is_VBZ", "Gaby_NNP"};
        double[] outcomeProbs = loadedMaxentModel.eval(context);

        String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
        System.out.println("=======================================");
        System.out.println(outcome);
        System.out.println("=======================================");
    }

}

和一些虚拟训练数据(存储) as training-file.txt ):

And some dummy training data (stored as training-file.txt):

class=Male      My_PRP name_NN is_VBZ John_NNP
class=Male      My_PRP name_NN is_VBZ Peter_NNP
class=Female    My_PRP name_NN is_VBZ Anna_NNP
class=Female    My_PRP name_NN is_VBZ Gaby_NNP

这会产生以下输出:

Indexing events using cutoff of 0
Computing event counts...  done. 4 events
Indexing...  done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 4
        Number of Outcomes: 2
      Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-2.772588722239781  0.5
  2:  ... loglikelihood=-2.4410105407571203 1.0
      ...
 99:  ... loglikelihood=-0.16111520541752372    1.0
100:  ... loglikelihood=-0.15953272940719138    1.0
=======================================
class=Female
=======================================

这篇关于在Java中为Maxent类文件创建训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆