在hadoop中合并小文件 [英] Merging small files in hadoop
问题描述
我在 HDFS 中有一个目录(最终目录),其中每分钟加载一些文件(例如:10 mb).一段时间后,我想将所有小文件合并为一个大文件(例如:100 mb).但是用户不断地将文件推送到最终目录.这是一个持续的过程.
I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
所以我第一次需要将前 10 个文件组合成一个大文件(例如:large.txt)并将文件保存到 Finaldir.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
现在我的问题是我将如何获得不包括前 10 个文件的接下来的 10 个文件?
Now my question is how i will get the next 10 files excluding the first 10 files?
有人可以帮我吗
推荐答案
这里还有一个替代方案,这仍然是@Andrew 在他的评论中指出的遗留方法,但有额外的步骤使您的输入文件夹作为缓冲区接收小文件,及时将它们推送到 tmp 目录并合并它们并将结果推送回输入.
Here is one more alternate, this is still the legacy approach pointed out by @Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
第一步:创建一个tmp目录
step 1 : create a tmp directory
hadoop fs -mkdir tmp
第2步:将某个时间点的所有小文件移动到tmp目录
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
第 3 步 - 在 hadoop-streaming jar 的帮助下合并小文件
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
-Dmapred.reduce.tasks=1
-input "/user/abc/input"
-output "/user/abc/output"
-mapper cat
-reducer cat
步骤 4- 将输出移动到输入文件夹
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
第 5 步 - 删除输出
step 5 - remove output
hadoop fs -rm -R output/
第 6 步 - 从 tmp 中删除所有文件
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
从第 2 步到第 6 步创建一个 shell 脚本,并安排它定期运行以定期合并较小的文件(根据您的需要可能每分钟一次)
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
为合并小文件而安排 cron 作业的步骤
步骤 1:在上述步骤(2 到 6)的帮助下创建一个 shell 脚本 /home/abc/mergejob.sh
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
重要提示:需要在脚本中指定hadoop的绝对路径,才能被cron理解
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
-Dmapred.reduce.tasks=1
-input "/user/abc/input"
-output "/user/abc/output"
-mapper cat
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
第 2 步:使用 cron 表达式安排脚本每分钟运行一次
step 2: schedule the script using cron to run every minute using cron expression
a) 通过选择编辑器编辑 crontab
a) edit crontab by choosing an editor
>crontab -e
b) 在最后添加以下行并退出编辑器
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
合并作业将安排为每分钟运行一次.
The merge job will be scheduled to run for every minute.
希望这对您有所帮助.
这篇关于在hadoop中合并小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!