我如何将hadoop中的两个文件连接成一个使用Hadoop FS shell的文件? [英] How can I concatenate two files in hadoop into one using Hadoop FS shell?

查看:555
本文介绍了我如何将hadoop中的两个文件连接成一个使用Hadoop FS shell的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Hadoop 0.20.2,如果可能的话,想使用-cat shell命令将两个文件连接成一个文件(来源: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html

这是我提交的命令(名称已更改):

  ** / path / path /path/hadoop-0.20.2> bin / hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv> / user / username / folder / outputdirectory / ** 

它返回bash:/ user / username / folder / outputdirectory /:没有这样的文件或目录

我也尝试创建该目录,然后再次运行它 - 我仍然得到'没有这样的文件或目录'错误。



我也尝试使用-cp命令将两者复制到一个新文件夹中,并使用-getmerge将它们组合起来,但getmerge也没有运气。



在hadoop中这样做的原因是这些文件非常庞大,需要很长时间才能在hadoop之外下载,合并和重新上传。

$ b

解决方案

该错误与您试图将命令的标准输出重定向回HDFS相关。有很多方法可以做到这一点,使用 hadoop fs -put 命令,其中source参数是一个hypen:

  bin / hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put  -  /user/username/folder/output.csv 

-getmerge 也输出到本地文件系统,而不是HDFS



Unforntunatley没有有效的方法将多个文件合并成一个文件(除非你想以查看Hadoop的'追加',但在您的hadoop版本中,默认情况下会禁用此功能,并且可能会出现错误),而无需将文件复制到一台机器,然后返回到HDFS,无论您是在




  • 自定义地图缩减作业,带一个缩放器和一个自定义缩放器,用于保留文件排序(记住每行将按键排序,因此您键将需要输入文件名和行号的一些组合,并且该值将是行本身)

  • 通过FsShell命令,取决于您的网络拓扑结构 - 即您的客户端控制台与datanode有很好的速度连接?这当然是你最不努力的工作,并且可能会比MR工作完成得更快(因为无论如何,所有事情都必须去一台机器,为什么不用你的本地控制台呢?)


I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)

Here is the command I'm submitting (names have been changed):

**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**

It returns bash: /user/username/folder/outputdirectory/: No such file or directory

I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.

I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.

The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.

解决方案

The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:

bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv

-getmerge also outputs to the local file system, not HDFS

Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in

  • a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
  • via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)

这篇关于我如何将hadoop中的两个文件连接成一个使用Hadoop FS shell的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆