在Hadoop的马preduce脚本导入外部库 [英] Import external libraries in an Hadoop MapReduce script

查看:139
本文介绍了在Hadoop的马preduce脚本导入外部库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行在亚马逊电子病历Hadoop实现最高蟒蛇马preduce脚本。由于从主脚本的结果,我得到逐项similiarities。在善后一步,我想拆分此输出成一个单独的S3存储的每个项目,因此每个项目斗含有类同它的项目清单。要做到这一点,我想用亚马逊博托Python库的善后步骤的减少作用。

I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.

  • 如何导入外部(蟒蛇)库到Hadoop的,让他们可以在用Python编写的一个降低步骤中使用?
  • 是否有可能访问S3那样的Hadoop的环境里面?

在此先感谢, 托马斯

推荐答案

在启动,您可以指定要提供外部文件的Hadoop的过程。这是通过使用 -files 参数做了。

When launching a hadoop process you can specify external files that should be made available. This is done by using the -files argument.

$ HADOOP_HOME /斌/ Hadoop的罐子/usr/lib/COMPANY/analytics/libjars/MyJar.jar -files HDFS://PDHadoop1.corp.COMPANY.com:54310 /数据/ geoip的/ GeoIPCity.dat

我不知道,如果这些文件必须在HDFS,但如果它是将要经常运行的作业,它不会是一个坏主意,把它们放在那里。
从code,你可以做类似

I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

这是所有,但复制并直接从工作code里面多了映射器中粘贴。

This is all but copy and pasted directly from working code inside multiple of our Mappers.

我不知道你问题的第二部分。希望答案的第一部分将让你开始。 :)

I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)

除了 -files -libjars 为包括额外的罐子;我对这里一点信息 - <一个href="http://stackoverflow.com/questions/4959762/if-i-have-a-constructor-that-requires-a-path-to-a-file-how-can-i-fake-that-if/4962508#4962508">If我有一个构造函数,需要一个文件路径,我怎么能假,如果它被打包成一个jar?

In addition to -files there is -libjars for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

这篇关于在Hadoop的马preduce脚本导入外部库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆