合并Hadoop中的小文件 [英] Merging small files in hadoop

查看:1249
本文介绍了合并Hadoop中的小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFS中有一个目录(最终目录),每分钟都会加载一些文件(例如:10 mb). 一段时间后,我想将所有小文件合并为一个大文件(例如:100 mb).但是用户不断将文件推送到Final Dir.这是一个连续的过程.

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.

因此,我第一次需要将前10个文件合并为一个大文件(例如:large.txt),然后将文件保存到Finaldir.

So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.

现在我的问题是我将如何获取除前10个文件之外的接下来的10个文件?

Now my question is how i will get the next 10 files excluding the first 10 files?

能不能帮我

推荐答案

这里还有另一种替代方法,它仍然是@Andrew在其注释中指出的传统方法,但还有一些额外的步骤使输入文件夹作为缓冲来接收小文件,将它们及时地推送到tmp目录,然后将它们合并并将结果推送回输入.

Here is one more alternate, this is still the legacy approach pointed out by @Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.

第1步:创建一个tmp目录

step 1 : create a tmp directory

hadoop fs -mkdir tmp

第2步:在某个时间点将所有小文件移至tmp目录

step 2 : move all the small files to the tmp directory at a point of time

hadoop fs -mv input/*.txt tmp

第3步-在hadoop-streaming jar的帮助下合并小文件

step 3 -merge the small files with the help of hadoop-streaming jar

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
                   -Dmapred.reduce.tasks=1 \
                   -input "/user/abc/input" \
                   -output "/user/abc/output" \
                   -mapper cat \
                   -reducer cat

第4步,将输出移至输入文件夹

step 4- move the output to the input folder

hadoop fs -mv output/part-00000 input/large_file.txt

第5步-删除输出

 hadoop fs -rm -R output/

第6步-从tmp中删除所有文件

step 6 - remove all the files from tmp

hadoop fs -rm tmp/*.txt

从第2步到第6步创建一个shell脚本,并安排它定期运行以合并较小的文件,并按固定的间隔(可能根据您的需要每分钟)

Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)

安排用于合并小文件的cron作业的步骤

步骤1:借助上述步骤(2至6)创建一个/home/abc/mergejob.sh 外壳程序脚本

step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)

重要说明:,您需要在脚本中指定hadoop的绝对路径,以供cron理解

important note: you need to specify the absolute path of hadoop in the script to be understood by cron

#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
                   -Dmapred.reduce.tasks=1 \
                   -input "/user/abc/input" \
                   -output "/user/abc/output" \
                   -mapper cat \
                   -reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt

第2步:使用cron安排脚本以使用cron表达式每分钟运行一次

step 2: schedule the script using cron to run every minute using cron expression

a)通过选择编辑器来编辑crontab

a) edit crontab by choosing an editor

>crontab -e

b)在末尾添加以下行,然后退出编辑器

b) add the following line at the end and exit from the editor

* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1

计划将合并作业每分钟运行一次.

The merge job will be scheduled to run for every minute.

希望这很有帮助.

这篇关于合并Hadoop中的小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆