无法在Hadoop map-reduce作业中加载OpenNLP句子模型 [英] Unable to load OpenNLP sentence model in Hadoop map-reduce job
问题描述
我试图将OpenNLP集成到Hadoop的map-reduce作业中,从一些基本的句子拆分开始。在map函数中,运行以下代码:
public AnalysisFile analyze(String content){
InputStream modelIn =空值;
String [] sentences = null;
//引用en-sent.bin的绝对路径
logger.info(sentenceModelPath:+ sentenceModelPath);
尝试{
modelIn = getClass()。getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
句子= sentenceBreaker.sentDetect(内容);
} catch(FileNotFoundException e){
logger.error(Unable to locate sentence model。);
e.printStackTrace();
} catch(IOException e){
e.printStackTrace();
} finally {
if(modelIn!= null){
try {
modelIn.close();
} catch(IOException e){
}
}
}
logger.info(句数:+句子长度);
< snip>
}
当我运行我的工作时,日志中出现错误不能是空的! (类抛出错误的源码),这意味着我无法打开InputStream到模型。其他花絮:
- 我已验证模型文件存在于位置
sentenceModelPath
是指。 - 我为opennlp-maxent添加了Maven依赖关系:3.0.2-incubating,opennlp-tools:1.5.2-incubating和opennlp-uima:1.5.2 -incubating。
- Hadoop正在我的本地机器上运行。
大部分是可以从 OpenNLP文档一>。在Hadoop端或OpenNLP端,是否有我缺少的东西,会导致我无法从模型中读取数据? 解决方案
你的问题是 getClass()。getResourceAsStream(sentenceModelPath)
行。这将尝试从类路径加载文件 - 既不是HDFS中的文件,也不是客户端本地文件系统中的文件是mapper / reducer运行时类路径的一部分,所以这就是为什么你会看到Null错误(getResourceAsStream()如果无法找到资源,则返回null)。
为了解决这个问题,您有很多选择:
-
修改您的代码以从HDFS加载文件:
$ b
<$ p $($ / $ / $ / $ / $ / $ / $ / $ / $)
-
修改代码以从本地加载文件dir,并使用
-files
GenericOptionsParser选项(它从本地文件系统复制到HDFS文件,然后返回到正在运行的映射器/缩减器的本地目录):
modelIn = new FileInputStream(en-sent.bin);
- 将文件硬拷贝到作业jar(根目录dir),并修改代码以包含一个前导斜杠:
modelIn = getClass()。getResourceAsStream (/en-sent.bin\");</li>
I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin
logger.info("sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (modelIn != null) {
try {
modelIn.close();
} catch (IOException e) {
}
}
}
logger.info("number of sentences: " + sentences.length);
<snip>
}
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
- I've verified that the model file exists in the location
sentenceModelPath
refers to. - I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
- Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?
Your problem is the getClass().getResourceAsStream(sentenceModelPath)
line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open( new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the
-files
GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):modelIn = new FileInputStream("en-sent.bin");
- Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>
这篇关于无法在Hadoop map-reduce作业中加载OpenNLP句子模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!