如何将HDFS小文件合并为一个大文件? [英] How to merge HDFS small files into a one large file?
问题描述
我有许多从Kafka流生成的小文件,所以我喜欢将小文件合并为一个文件,但是这种合并是基于日期的,即原始文件夹可能有多个先前的文件,但是我只喜欢在给定的日期进行合并文件到一个文件.
I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
有什么建议吗?
推荐答案
使用类似下面的代码的方法迭代较小的文件并将它们聚合为一个大文件(假设source
包含指向较小文件的HDFS路径) ,并且target
是您想要大结果文件的路径):
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source
contains the HDFS path to your smaller files, and target
is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
此示例假定文本文件格式,但是您也可以读取任何Spark支持的格式,并且也可以将不同的格式用于源和目标
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
这篇关于如何将HDFS小文件合并为一个大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!