合并hdfs文件 [英] Merging hdfs files

查看:114
本文介绍了合并hdfs文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有1000多个HDFS文件,命名约定为 1_fileName.txt N_fileName.txt 。每个文件的大小是1024 MB。
我需要将这些文件合并到一个文件(HDFS)中并保持文件顺序。说 5_FileName.txt 只能追加到 4_fileName.txt

I have 1000+ files available in HDFS with a naming convention of 1_fileName.txt to N_fileName.txt. Size of each file is 1024 MB. I need to merge these files in to one (HDFS)with keeping the order of the file. Say 5_FileName.txt should append only after 4_fileName.txt

执行此操作的最佳和最快的方式是什么。

What is the best and fastest way to perform this operation.

是否有任何方法可以在不复制数据节点之间的实际数据的情况下执行此合并?例如:获取这些文件的块位置,并使用这些块位置在Namenode中创建一个新条目(FileName)?

Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?

推荐答案

没有这样做的有效方式,您需要将所有数据移动到一个节点,然后回到HDFS。

There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS.

一个命令行脚本来执行此操作可能如下所示:

A command line scriptlet to do this could be as follows:

hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt

这会将与glob匹配的所有文件捕获到标准输出,那么你会将该流传输到put命令,并将该流输出到名为targetFilename.txt的HDFS文件。

This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt

你遇到的唯一问题是你已经离开的文件名结构因为 - 如果你有固定的宽度,将数字部分zeropadded会更容易,但在它的当前状态下,你会得到一个意想不到的书法顺序(1,10,100,1000,11,110等)而不是数字顺序( 1,2,3,4等)。您可以通过修改scriptlet来解决此问题:

The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to:

hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \
    [0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt

这篇关于合并hdfs文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆