Hadoop:我怎样才能减少输出到一个文件? [英] Hadoop: How can i merge reducer outputs to a single file?

查看:119
本文介绍了Hadoop:我怎样才能减少输出到一个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道shell中的getmerge命令可以完成这项工作。



但是如果我想在HDFS API作业之后合并这些输出,我该怎么办?对于java?

我真正想要的是HDFS上的一个合并文件。



我唯一能做的就是想到的是在此之后开始一项额外的工作。

谢谢!

解决方案


我应该怎么做,如果我想要合并这些输出后的HDFS API为Java?

猜测,因为我没有自己尝试过,但我认为你正在寻找的方法是 FileUtil.copyMerge ,这是FsShell在运行时调用的方法 - getmerge 命令。 FileUtil.copyMerge 将两个FileSystem对象作为参数--FsShell使用FileSystem.getLocal来检索目标文件系统,但我没有看到任何导致您无法使用Path的原因。目标上的getFileSystem获得OutputStream

也就是说,我认为它不会赢得您的赞许 - 合并仍然在本地JVM中发生;所以你不是真的在 -getmerge 后加上 -put


I know that "getmerge" command in shell can do this work.

But what should I do if I want to merge these outputs after the job by HDFS API for java?

What i actually want is a single merged file on HDFS.

The only thing i can think of is to start an additional job after that.

thanks!

解决方案

But what should I do if I want to merge these outputs after the job by HDFS API for java?

Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge command. FileUtil.copyMerge takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream

That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge followed by -put.

这篇关于Hadoop:我怎样才能减少输出到一个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆