合并Hadoop中的小文件 [英] Merging small files in hadoop
问题描述
我在HDFS中有一个目录(最终目录),每分钟都会加载一些文件(例如:10 mb). 一段时间后,我想将所有小文件合并为一个大文件(例如:100 mb).但是用户不断将文件推送到Final Dir.这是一个连续的过程.
I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
因此,我第一次需要将前10个文件合并为一个大文件(例如:large.txt),然后将文件保存到Finaldir.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
现在我的问题是我将如何获取除前10个文件之外的接下来的10个文件?
Now my question is how i will get the next 10 files excluding the first 10 files?
能不能帮我
推荐答案
这里还有另一种替代方法,它仍然是@Andrew在其注释中指出的传统方法,但还有一些额外的步骤使输入文件夹作为缓冲来接收小文件,将它们及时地推送到tmp目录,然后将它们合并并将结果推送回输入.
Here is one more alternate, this is still the legacy approach pointed out by @Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
第1步:创建一个tmp目录
step 1 : create a tmp directory
hadoop fs -mkdir tmp
第2步:在某个时间点将所有小文件移至tmp目录
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
第3步-在hadoop-streaming jar的帮助下合并小文件
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
第4步,将输出移至输入文件夹
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
第5步-删除输出
hadoop fs -rm -R output/
第6步-从tmp中删除所有文件
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
从第2步到第6步创建一个shell脚本,并安排它定期运行以合并较小的文件,并按固定的间隔(可能根据您的需要每分钟)
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
安排用于合并小文件的cron作业的步骤
步骤1:借助上述步骤(2至6)创建一个/home/abc/mergejob.sh 外壳程序脚本
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
重要说明:,您需要在脚本中指定hadoop的绝对路径,以供cron理解
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
第2步:使用cron安排脚本以使用cron表达式每分钟运行一次
step 2: schedule the script using cron to run every minute using cron expression
a)通过选择编辑器来编辑crontab
a) edit crontab by choosing an editor
>crontab -e
b)在末尾添加以下行,然后退出编辑器
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
计划将合并作业每分钟运行一次.
The merge job will be scheduled to run for every minute.
希望这很有帮助.
这篇关于合并Hadoop中的小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!