无法在Hadoop map-reduce作业中加载OpenNLP句子模型 [英] Unable to load OpenNLP sentence model in Hadoop map-reduce job

查看:164
本文介绍了无法在Hadoop map-reduce作业中加载OpenNLP句子模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将OpenNLP集成到Hadoop的map-reduce作业中,从一些基本的句子拆分开始。在map函数中,运行以下代码:

  public AnalysisFile analyze(String content){
InputStream modelIn =空值;
String [] sentences = null;

//引用en-sent.bin的绝对路径
logger.info(sentenceModelPath:+ sentenceModelPath);

尝试{
modelIn = getClass()。getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
句子= sentenceBreaker.sentDetect(内容);
} catch(FileNotFoundException e){
logger.error(Unable to locate sentence model。);
e.printStackTrace();
} catch(IOException e){
e.printStackTrace();
} finally {
if(modelIn!= null){
try {
modelIn.close();
} catch(IOException e){
}
}
}

logger.info(句数:+句子长度);

< snip>
}

当我运行我的工作时,日志中出现错误不能是空的! (类抛出错误的源码),这意味着我无法打开InputStream到模型。其他花絮:


  • 我已验证模型文件存在于位置 sentenceModelPath 是指。

  • 我为opennlp-maxent添加了Maven依赖关系:3.0.2-incubating,opennlp-tools:1.5.2-incubating和opennlp-uima:1.5.2 -incubating。

  • Hadoop正在我的本地机器上运行。



大部分是可以从 OpenNLP文档。在Hadoop端或OpenNLP端,是否有我缺少的东西,会导致我无法从模型中读取数据? 解决方案

你的问题是 getClass()。getResourceAsStream(sentenceModelPath)行。这将尝试从类路径加载文件 - 既不是HDFS中的文件,也不是客户端本地文件系统中的文件是mapper / reducer运行时类路径的一部分,所以这就是为什么你会看到Null错误(getResourceAsStream()如果无法找到资源,则返回null)。

为了解决这个问题,您有很多选择:




  • 修改您的代码以从HDFS加载文件:
    $ b


    <$ p $($ / $ / $ / $ / $ / $ / $ / $ / $)



  • 修改代码以从本地加载文件dir,并使用 -files GenericOptionsParser选项(它从本地文件系统复制到HDFS文件,然后返回到正在运行的映射器/缩减器的本地目录):


      modelIn = new FileInputStream(en-sent.bin); 



  • 将文件硬拷贝到作业jar(根目录dir),并修改代码以包含一个前导斜杠:


      modelIn = getClass()。getResourceAsStream (/en-sent.bin\");</li> 


I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:

public AnalysisFile analyze(String content) {
    InputStream modelIn = null;
    String[] sentences = null;

    // references an absolute path to en-sent.bin
    logger.info("sentenceModelPath: " + sentenceModelPath);

    try {
        modelIn = getClass().getResourceAsStream(sentenceModelPath);
        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
        sentences = sentenceBreaker.sentDetect(content);
    } catch (FileNotFoundException e) {
        logger.error("Unable to locate sentence model.");
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (modelIn != null) {
            try {
                modelIn.close();
            } catch (IOException e) {
            }
        }
    }

    logger.info("number of sentences: " + sentences.length);

    <snip>
}

When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:

  • I've verified that the model file exists in the location sentenceModelPath refers to.
  • I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
  • Hadoop is just running on my local machine.

Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?

解决方案

Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).

To get around this you have a number of options:

  • Amend your code to load the file from HDFS:

    modelIn = FileSystem.get(context.getConfiguration()).open(
                     new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
    

  • Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):

    modelIn = new FileInputStream("en-sent.bin");
    

  • Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:

    modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>
    

这篇关于无法在Hadoop map-reduce作业中加载OpenNLP句子模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆