使用Stanford CoreNLP(3.5.2) [英] Concurrent processing using Stanford CoreNLP (3.5.2)

查看:672
本文介绍了使用Stanford CoreNLP(3.5.2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临着同时注释多个句子的并发问题。我不清楚我是在做错事,还是在CoreNLP中有一个错误。



我的目标是用管道注释句子tokenize,ssplit,pos ,lemma,ner,parse,dcoref使用并行运行的几个线程。每个线程分配自己的StanfordCoreNLP实例,然后将其用于注释。



问题是在某些时候抛出异常:



  util.ConcurrentModificationException在java.util.ArrayList $ Itr.checkForComodification(ArrayList.java:901)在java.util.ArrayList $ Itr.next(ArrayList.java:851)在java.util.Collections $ UnmodifiableCollection $ 1.next(Collections .java:1042)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp。 (GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)在edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode (GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford。 nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)at edu.stanford.nlp.trees.GrammaticalStructure。< init>(GrammaticalStructure.java:201)at edu.stanford.nlp.trees.EnglishGrammaticalStructure。< init> ;(EnglishGrammaticalStructure.java:89)at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)at edu.stanford .nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412) 

http://www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.gz



  package semante.parser.stanford.server; import java.io.FileInputStream; import java.io.InputStreamReader; import java.io.OutputStream; import java.io.PrintStream; import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.List; import java.util.Properties; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.TimeUnit; import java。 util.concurrent.atomic.AtomicInteger; import javax.xml.bind.JAXBContext; import javax.xml.bind.Unmarshaller; import javax.xml.bind.annotation.XmlAccessType; import javax.xml.bind.annotation.XmlAccessorType; import javax .xml.bind.annotation.XmlAttribute; import javax.xml.bind.annotation.XmlElement; import javax.xml.bind.annotation.XmlRootElement; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline .StanfordCoreNLP; public class StanfordMultiThreadingTest {@XmlRootElement(name =entailment-corpus)@XmlAccessorType(XmlAccessType.FIELD)public static class Corpus {@XmlElement(name =pair)private List& pairList = new ArrayList< Pair>(); public void addPair(Pair p){pairList.add(p);} public List< Pair> getPairList(){return pairList;}} @XmlRootElement(name =pair)public static class Pair {@XmlAttribute(name =id)String id; @XmlAttribute(name =entailment)String entailment; @XmlElement(name =t)String t; @XmlElement(name =h)String h; private Pair(){} public Pair(int id,boolean entailment,String t,String h){this(); this.id = Integer.toString(id); this.entailment = entailment? YES:NO; this.t = t; this.h = h; } public String getH(){return h;}} public String getId(){return id;} public String getEntailment(){return entailment;} public String getT void write(int b){}};私人语料库;私人Unmarshaller unmarshaller; private ExecutorService executor; public StanfordMultiThreadingTest()throws Exception {javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class); unmarshaller = jaxbCtx.createUnmarshaller(); executor = Executors.newFixedThreadPool(Runtime.getRuntime()。availableProcessors()); } public void readXML(String fileName)throws Exception {System.out.println(Reading XML  -  Started); corpus =(Corpus)unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName),StandardCharsets.UTF_8)); System.out.println(Reading XML  -  Ended); } public void parseSentences()throws Exception {System.out.println(Parsing  -  Started); //将对转换成句子列表List< String> sentence = new ArrayList< String>(); for(Pair pair:corpus.getPairList()){sentence.add(pair.getT()); sentence.add(pair.getH()); } //准备属性final属性props = new Properties(); props.put(annotators,tokenize,ssplit,pos,lemma,ner,parse,dcoref); //首次运行是长的,因为模型加载了新的StanfordCoreNLP(props); //以避免CoreNLP初始化打印(例如添加注释pos)final PrintStream nullPrintStream = new PrintStream(new NullStream()); PrintStream err = System.err; System.setErr(nullPrintStream); int totalCount = sentence.size(); AtomicInteger counter = new AtomicInteger(0); //使用java并发并行解析for(String sentence:sentence){executor.execute(new Runnable(){@Override public void run(){try {StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation annotation = new Annotation (Done)+ String.format(%。2f,counter.get(%。2f));如果(counter.incrementAndGet()%20 == 0); system.out.println )* 100 /(double)totalCount));};} catch(Exception e){System.setErr(err); e.printStackTrace(); System.setErr(nullPrintStream); executor.shutdownNow();}}} ; } executor.shutdown(); System.out.println(Waiting for parsing to end。); executor.awaitTermination(10,TimeUnit.MINUTES); System.out.println(Parsing  -  Ended); } public static void main(String [] args)throws Exception {StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest(); smtt.readXML(args [0]); smtt.parseSentences(); }} 



在我试图找到一些背景信息时,由 Christopher Manning 来自斯坦福的Gabor Angeli ,表明当代版本的Stanford CoreNLP应该是线程安全的。但是,最近的一份针对CoreNLP 3.4版的错误报告。图1描述了并发性问题。正如标题中提到的,我使用的是版本3.5.2。



我不清楚我面对的问题是由于错误还是由于某种错误的方式我使用的包。如果有更多的知识可以阐明这一点,我会感激。我希望示例代码将有助于重现这个问题。谢谢!



[1]:

解决方案

线程选项?您可以为单个 StanfordCoreNLP 管道指定多个线程,然后它将并行处理句子。



例如,如果要在8个内核上处理句子,请将线程选项设置为 8

 属性props = new Properties(); 
props.put(annotators,tokenize,ssplit,pos,lemma,ner,parse,dcoref);
props.put(threads,8)
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

然而我认为你的解决方案也应该工作,我们会检查是否有一些并发性错误,使用此选项可能会同时解决您的问题。


I'm facing a concurrency problem in annotating multiple sentences simultaneously. It's unclear to me whether I'm doing something wrong or maybe there is a bug in CoreNLP.

My goal is to annotate sentences with the pipeline "tokenize, ssplit, pos, lemma, ner, parse, dcoref" using several threads running in parallel. Each thread allocates its own instance of StanfordCoreNLP and then uses it for the annotation.

The problem is that at some point an exception is thrown:

java.util.ConcurrentModificationException
	at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
	at java.util.ArrayList$Itr.next(ArrayList.java:851)
	at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.<init>(GrammaticalStructure.java:201)
	at edu.stanford.nlp.trees.EnglishGrammaticalStructure.<init>(EnglishGrammaticalStructure.java:89)
	at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)
	at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)
	at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
	at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)

I'm attaching a sample code of an application that reproduces the problem in about 20 seconds on my Core i3 370M laptop (Win 7 64bit, Java 1.8.0.45 64bit). This app reads an XML file of the Recognizing Textual Entailment (RTE) corpora and then parses all sentences simultaneously using standard Java concurrency classes. The path to a local RTE XML file needs to be given as a command line argument. In my tests I used the publicly available XML file here: http://www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.gz

package semante.parser.stanford.server;

import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

public class StanfordMultiThreadingTest {

	@XmlRootElement(name = "entailment-corpus")
	@XmlAccessorType (XmlAccessType.FIELD)
	public static class Corpus {
		@XmlElement(name = "pair")
		private List<Pair> pairList = new ArrayList<Pair>();

		public void addPair(Pair p) {pairList.add(p);}
		public List<Pair> getPairList() {return pairList;}
	}

	@XmlRootElement(name="pair")
	public static class Pair {

		@XmlAttribute(name = "id")
		String id;

		@XmlAttribute(name = "entailment")
		String entailment;

		@XmlElement(name = "t")
		String t;

		@XmlElement(name = "h")
		String h;

		private Pair() {}

		public Pair(int id, boolean entailment, String t, String h) {
			this();
			this.id = Integer.toString(id);
			this.entailment = entailment ? "YES" : "NO";
			this.t = t;
			this.h = h;
		}

		public String getId() {return id;}
		public String getEntailment() {return entailment;}
		public String getT() {return t;}
		public String getH() {return h;}
	}
	
	class NullStream extends OutputStream {
		@Override 
		public void write(int b) {}
	};

	private Corpus corpus;
	private Unmarshaller unmarshaller;
	private ExecutorService executor;

	public StanfordMultiThreadingTest() throws Exception {
		javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class);
		unmarshaller = jaxbCtx.createUnmarshaller();
		executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
	}

	public void readXML(String fileName) throws Exception {
		System.out.println("Reading XML - Started");
		corpus = (Corpus) unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName), StandardCharsets.UTF_8));
		System.out.println("Reading XML - Ended");
	}

	public void parseSentences() throws Exception {
		System.out.println("Parsing - Started");

		// turn pairs into a list of sentences
		List<String> sentences = new ArrayList<String>();
		for (Pair pair : corpus.getPairList()) {
			sentences.add(pair.getT());
			sentences.add(pair.getH());
		}

		// prepare the properties
		final Properties props = new Properties();
		props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

		// first run is long since models are loaded
		new StanfordCoreNLP(props);

		// to avoid the CoreNLP initialization prints (e.g. "Adding annotation pos")
		final PrintStream nullPrintStream = new PrintStream(new NullStream());
		PrintStream err = System.err;
		System.setErr(nullPrintStream);

		int totalCount = sentences.size();
		AtomicInteger counter = new AtomicInteger(0);

		// use java concurrency to parallelize the parsing
		for (String sentence : sentences) {
			executor.execute(new Runnable() {
				@Override
				public void run() {
					try {
						StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
						Annotation annotation = new Annotation(sentence);
						pipeline.annotate(annotation);
						if (counter.incrementAndGet() % 20 == 0) {
							System.out.println("Done: " + String.format("%.2f", counter.get()*100/(double)totalCount));
						};
					} catch (Exception e) {
						System.setErr(err);
						e.printStackTrace();
						System.setErr(nullPrintStream);
						executor.shutdownNow();
					}
				}
			});
		}
		executor.shutdown();
		
		System.out.println("Waiting for parsing to end.");		
		executor.awaitTermination(10, TimeUnit.MINUTES);

		System.out.println("Parsing - Ended");
	}

	public static void main(String[] args) throws Exception {
		StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest();
		smtt.readXML(args[0]);
		smtt.parseSentences();
	}

}

In my attempt to find some background information I encountered answers given by Christopher Manning and Gabor Angeli from Stanford which indicate that contemporary versions of Stanford CoreNLP should be thread-safe. However, a recent bug report on CoreNLP version 3.4.1 describes a concurrency problem. As mentioned in the title, I'm using version 3.5.2.

It's unclear to me whether the problem I'm facing is due to a bug or due to something wrong in the way I use the package. I'd appreciate it if someone more knowledgeable could shed some light on this. I hope that the sample code would be useful for reproducing the problem. Thanks!

[1]:

解决方案

Have you tried using the threads option? You can specify a number of threads for a single StanfordCoreNLP pipeline and then it will process sentences in parallel.

For example, if you want to process sentences on 8 cores, set the threads option to 8:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("threads", "8")
StanfordCoreNLP pipeline  = new StanfordCoreNLP(props);

Nevertheless I think your solution should also work and we'll check whether there is some concurrency bug, but using this option might solve your problem in the meantime.

这篇关于使用Stanford CoreNLP(3.5.2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆