如何对自定义NameFinder模型进行OpenNLP培训? [英] How to conduct OpenNLP training for custom NameFinder model?

查看:74
本文介绍了如何对自定义NameFinder模型进行OpenNLP培训?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从查询中获取实体.

I am trying to get entities from a query.

我有一个自定义的NameFinder模型.

I have a custom NameFinder model.

查询就是这样.

result for roll number 1304510020.
result for roll-number 1304510020.
result for rollnumber 1304510020.
result of rollnumber 1304510020.
result of roll number 1304510020.
result of roll-number 1304510020.
roll number 1304510020 result.
rollnumber 1304510020 result.
roll-number 1304510020 result.
show result of roll number 1304510020.
show result of rollnumber 1304510020.
show result of roll-number 1304510020.
show my result for 1304510020.
result of 1304510020.

这是我的培训代码

package nlpParser;

import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
public class Trainer {
	// training data set
    static String trainingPath = 
    		"C:\\Users\\MujeebulHasan\\Desktop\\Project\\hbtu\\hbtuaiagent\\Source Code\\parser\\training\\";
    
    public static void main(String[] args) throws IOException {

    	String[] entities = new String[]{"rollnumber","result"};
    	String[] pathsOfTraingFile = new String[]{"rollnumber\\rollnumber.train","result\\result.train"};
    	String[] pathsOfTrainedFile = new String[]{"rollnumber\\rollnumber.bin","result\\result.bin"};
    	
    	for(int i = 0; i < entities.length; i++){
    		final int j = i;
		    InputStreamFactory isf = new InputStreamFactory() {
		        public InputStream createInputStream() throws IOException {
		            return new FileInputStream(trainingPath+pathsOfTraingFile[j]);
		        }
		    };
		    Charset charset = Charset.forName("UTF-8");
		    ObjectStream<String> lineStream = new PlainTextByLineStream(isf, charset);
		    ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
		    TokenNameFinderModel model;
		    TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory();
		    try {
		        model = NameFinderME.train("en", entities[i], sampleStream, TrainingParameters.defaultParams(),
		                nameFinderFactory);
		    } finally {
		        sampleStream.close();
		    }
		    BufferedOutputStream modelOut = null;
		    try {
		        modelOut = new BufferedOutputStream(new FileOutputStream(trainingPath+pathsOfTrainedFile[i]));
		        model.serialize(modelOut);
		    } finally {
		        if (modelOut != null)
		            modelOut.close();
		    }
    	}
    }
}

rollnumber.train

result for roll number <START:rollnumber> 1304510020 <END> .
result for roll-number <START:rollnumber> 1304510020 <END> .
result for rollnumber <START:rollnumber> 1304510020 <END> .
result for roll <START:rollnumber> 1304510020 <END> .
result of rollnumber <START:rollnumber> 1304510020 <END> .
result of roll number <START:rollnumber> 1304510020 <END> .
result of roll-number <START:rollnumber> 1304510020 <END> .
result of roll <START:rollnumber> 1304510020 <END> .
roll number <START:rollnumber> 1304510020 <END> result.
rollnumber <START:rollnumber> 1304510020 <END> result.
roll-number <START:rollnumber> 1304510020 <END> result.
roll <START:rollnumber> 1304510020 <END> result.
show result of roll number <START:rollnumber> 1304510020 <END> .
show result of rollnumber <START:rollnumber> 1304510020 <END> .
show result of roll-number <START:rollnumber> 1304510020 <END> .
show result of roll <START:rollnumber> 1304510020 <END> .
show my result for <START:rollnumber> 1304510020 <END> .
result of <START:rollnumber> 1304510020 <END> .
result for <START:rollnumber> 1304510020 <END> .
what is my result for rollnumber <START:rollnumber> 1304510020 <END> .
what is my result of rollnumber <START:rollnumber> 1304510020 <END> .
what is my result for roll <START:rollnumber> 1304510020 <END> .

结果训练

<START:result> result <END> for roll number 1304510020.
<START:result> result <END> for roll-number 1304510020.
<START:result> result <END> for rollnumber 1304510020.
<START:result> result <END> of rollnumber 1304510020.
<START:result> result <END> of roll number 1304510020.
<START:result> result <END> of roll-number 1304510020.
roll number 1304510020 <START:result> result <END> .
rollnumber 1304510020 <START:result> result <END> .
roll-number 1304510020 <START:result> result <END> .
show <START:result> result <END> of roll number 1304510020.
show <START:result> result <END> of rollnumber 1304510020.
show <START:result> result <END> of roll-number 1304510020.
show my <START:result> result <END> for 1304510020.
<START:result> result <END> of 1304510020.

当我使用此代码对其进行测试时.

When I test it using this code.

package nlpParser;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;

public class GetEntities {
	public static void main(String[] args) throws IOException {
		Scanner sc = new Scanner(System.in);
		String query ="";
		GetEntities obj = new GetEntities();
		while((query = sc.nextLine()) != " "){
			obj.parse(query);
		}
		sc.close();
	}
	public void parse(String query) throws IOException{
		String[] entities = new String[]{"rollnumber","result"};
		   String[] pathsOfTrainedFile = new String[]{"rollnumber\\rollnumber.bin","result\\result.bin"};
		   
		   for(int i = 0 ; i < entities.length; i++){
			   //Loading the NER model       
			   InputStream inputStream = new 
			   FileInputStream("C:\\Users\\MujeebulHasan\\Desktop\\Project\\hbtu\\hbtuaiagent\\Source Code\\parser\\training\\"+pathsOfTrainedFile[i]); 
			   TokenNameFinderModel model = new TokenNameFinderModel(inputStream);
			   //Instantiating the NameFinder class 
			   NameFinderME nameFinder = new NameFinderME(model); 
	    	   
				   //Finding the names in the sentence 
	    		   System.out.println("Processing query... ");
	    		   System.out.print("Query = "+query);
				   query = query.replace(".", "");
				   String[] sentence = query.split(" ");
				   System.out.println();
				   System.out.println("RESULT :");
				   Span nameSpans[] = nameFinder.find(sentence); 
				   //Printing the spans of the names in the sentence 
				   for(Span s: nameSpans) {
					   System.out.println(s.toString());
					   System.out.println(sentence[s.getStart()]);
				   }
			   }
		   }
}

它给出以下结果.有时候哪里错了.

It gives following result. Which are wrong some times.

result of roll number 1304510020
Processing query... 
Query = result of roll number 1304510020
RESULT :
Processing query... 
Query = result of roll number 1304510020
RESULT :
[0..1) result
result
show result for roll number 1304510020
Processing query... 
Query = show result for roll number 1304510020
RESULT :
Processing query... 
Query = show result for roll number 1304510020
RESULT :
[1..2) result
result
result for rollnumber 1304510020
Processing query... 
Query = result for rollnumber 1304510020
RESULT :
[3..4) rollnumber
1304510020
Processing query... 
Query = result for rollnumber 1304510020
RESULT :
[0..1) result
result
result 1304510020
Processing query... 
Query = result 1304510020
RESULT :
Processing query... 
Query = result 1304510020
RESULT :
[0..1) result
result
1304510020 result
Processing query... 
Query = 1304510020 result
RESULT :
Processing query... 
Query = 1304510020 result
RESULT :
[1..2) result
result

推荐答案

这种情况会发生.由于您的训练数据量大.根据OpenNLP文档,您必须在训练数据中包含大约15,000行才能获得良好的效果.

This happens. Due to the size of your training data. According to the OpenNLP Documentation, You must have around 15,000 lines in the training data inorder to get good results.

如果没有足够的数据,则可以简单地使用正则表达式,这比所有这些都容易得多.

If you don't have enough data, you can simply use Regular Expressions in your case which is a lot easier that all of this.

如果您愿意制作更大的培训数据集,则可以遵循此内容或再次使用RegEX标记非常大的语料库.

If you are willing to make a larger training data-set, you can follow this or again use RegEX to tag your very large corpus.

希望这会有所帮助!

这篇关于如何对自定义NameFinder模型进行OpenNLP培训?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆