Spark DataFrame写入方法,可写入许多小文件 [英] Spark dataframe write method writing many small files
问题描述
我有一个相当简单的工作,可以将日志文件覆盖到镶木地板中.它正在处理1.1TB的数据(分块为64MB-128MB文件-我们的块大小为128MB),大约是1.2万个文件.
I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
工作原理如下:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
它使用通用模式收集事件,将其转换为DataFrame,然后将其写为实木复合地板.
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
我遇到的问题是,这可能会在HDFS群集上创建一些IO爆炸,因为它试图创建许多小文件.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
理想情况下,我只想在日期"分区中创建少量镶木地板文件.
Ideally I want to create only a handful of parquet files within the partition 'date'.
控制它的最佳方法是什么?是通过使用'coalesce()'吗?
What would be the best way to control this? Is it by using 'coalesce()'?
这将如何影响给定分区中创建的文件数量?它取决于我在Spark中工作的执行者数量吗? (当前设置为100).
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).
推荐答案
您必须重新分配DataFrame
以匹配DataFrameWriter
you have to repartiton your DataFrame
to match the partitioning of the DataFrameWriter
尝试一下:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
这篇关于Spark DataFrame写入方法,可写入许多小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!