写入时如何控制Spark作业创建的输出部分文件的数量? [英] How to control the number of output part files created by Spark job upon writing?

查看:52
本文介绍了写入时如何控制Spark作业创建的输出部分文件的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个 Spark 作业,每天处理数千个文件.文件大小可能从 MB 到 GB.完成工作后,我通常使用以下代码保存

I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code

finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4

Spark 作业在最终输出目录中创建了大量小零件文件.据我了解 Spark 为每个分区/任务创建部分文件,如果我错了,请纠正我.我们如何控制 Spark 创建的零件文件数量?

Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong. How do we control amount of part files Spark creates?

最后,我想使用这些 parquet/orc 目录创建 Hive 表,我听说当我们有大量小文件时,Hive 很慢.

Finally, I would like to create Hive table using these parquet/orc directory and I heard Hive is slow when we have large no of small files.

推荐答案

您可能想尝试使用 DataFrame.coalesce 方法来减少分区数;它返回一个具有指定分区数的 DataFrame(每个分区在插入时成为一个文件).

You may want to try using the DataFrame.coalesce method to decrease the number of partitions; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion).

要增加或减少分区,您可以使用 Dataframe.repartition 函数.但是 coalesce 不会导致 shuffle,而 repartition 会.

To increase or decrease the partitions you can use Dataframe.repartition function. But coalesce does not cause shuffle while repartition does.

这篇关于写入时如何控制Spark作业创建的输出部分文件的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆