在hadoop中合并小文件 [英] Merging small files in hadoop

查看:87
本文介绍了在hadoop中合并小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 HDFS 中有一个目录(最终目录),其中每分钟加载一些文件(例如:10 mb).一段时间后,我想将所有小文件合并为一个大文件(例如:100 mb).但是用户不断地将文件推送到最终目录.这是一个持续的过程.

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.

所以我第一次需要将前 10 个文件组合成一个大文件(例如:large.txt)并将文件保存到 Finaldir.

So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.

现在我的问题是我将如何获得不包括前 10 个文件的接下来的 10 个文件?

Now my question is how i will get the next 10 files excluding the first 10 files?

有人可以帮我吗

推荐答案

这里还有一个替代方案,这仍然是@Andrew 在他的评论中指出的遗留方法,但有额外的步骤使您的输入文件夹作为缓冲区接收小文件,及时将它们推送到 tmp 目录并合并它们并将结果推送回输入.

Here is one more alternate, this is still the legacy approach pointed out by @Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.

第一步:创建一个tmp目录

step 1 : create a tmp directory

hadoop fs -mkdir tmp

第2步:将某个时间点的所有小文件移动到tmp目录

step 2 : move all the small files to the tmp directory at a point of time

hadoop fs -mv input/*.txt tmp

第 3 步 - 在 hadoop-streaming jar 的帮助下合并小文件

step 3 -merge the small files with the help of hadoop-streaming jar

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
                   -Dmapred.reduce.tasks=1 
                   -input "/user/abc/input" 
                   -output "/user/abc/output" 
                   -mapper cat 
                   -reducer cat

步骤 4- 将输出移动到输入文件夹

step 4- move the output to the input folder

hadoop fs -mv output/part-00000 input/large_file.txt

第 5 步 - 删除输出

step 5 - remove output

 hadoop fs -rm -R output/

第 6 步 - 从 tmp 中删除所有文件

step 6 - remove all the files from tmp

hadoop fs -rm tmp/*.txt

从第 2 步到第 6 步创建一个 shell 脚本,并安排它定期运行以定期合并较小的文件(根据您的需要可能每分钟一次)

Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)

为合并小文件而安排 cron 作业的步骤

步骤 1:在上述步骤(2 到 6)的帮助下创建一个 shell 脚本 /home/abc/mergejob.sh

step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)

重要提示:需要在脚本中指定hadoop的绝对路径,才能被cron理解

important note: you need to specify the absolute path of hadoop in the script to be understood by cron

#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
                   -Dmapred.reduce.tasks=1 
                   -input "/user/abc/input" 
                   -output "/user/abc/output" 
                   -mapper cat 
                   -reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt

第 2 步:使用 cron 表达式安排脚本每分钟运行一次

step 2: schedule the script using cron to run every minute using cron expression

a) 通过选择编辑器编辑 crontab

a) edit crontab by choosing an editor

>crontab -e

b) 在最后添加以下行并退出编辑器

b) add the following line at the end and exit from the editor

* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1

合并作业将安排为每分钟运行一次.

The merge job will be scheduled to run for every minute.

希望这对您有所帮助.

这篇关于在hadoop中合并小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆