如何在分布式缓存中使用MapReduce输出 [英] How to use a MapReduce output in Distributed Cache
问题描述
假设我有一个MapReduce作业正在创建输出文件part-00000
,并且在完成此作业后还有一个作业正在运行.
Lets say i have a MapReduce Job which is creating an output file part-00000
and there is one more job running after the completion of this job.
如何将分布式缓存中第一个作业的输出文件用于第二个作业.
How can i use the output file of the first job in the Distributed cache for the second job.
推荐答案
以下步骤可能会对您有所帮助,
The below steps might help you,
-
将第一个作业的输出目录路径传递给第二个作业的驱动程序 课.
Pass the first job's output directory path to the Second job's Driver class.
使用路径过滤器列出以part-*
开头的文件.请参考下面的代码片段作为您的第二个作业驱动程序类,
Use Path Filter to list files starts with part-*
. Refer the below code snippet for your second job driver class,
FileSystem fs = FileSystem.get(conf);
FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") ,
new PathFilter(){
@Override public boolean accept(Path path){
return path.getName().startsWith("part-");
}
} );
遍历每个part-*
文件,并将其添加以分发缓存.
Iterate over every part-*
file and add it to distribute cache.
for(int i=0; i < fileList.length;i++){
DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
}
这篇关于如何在分布式缓存中使用MapReduce输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!