Spark DataFrame写入方法,可写入许多小文件 [英] Spark dataframe write method writing many small files

查看:818
本文介绍了Spark DataFrame写入方法,可写入许多小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当简单的工作,可以将日志文件覆盖到镶木地板中.它正在处理1.1TB的数据(分块为64MB-128MB文件-我们的块大小为128MB),大约是1.2万个文件.

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.

工作原理如下:

 val events = spark.sparkContext
  .textFile(s"$stream/$sourcetype")
  .map(_.split(" \\|\\| ").toList)
  .collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
  .toDF()

df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")

它使用通用模式收集事件,将其转换为DataFrame,然后将其写为实木复合地板.

It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.

我遇到的问题是,这可能会在HDFS群集上创建一些IO爆炸,因为它试图创建许多小文件.

The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.

理想情况下,我只想在日期"分区中创建少量镶木地板文件.

Ideally I want to create only a handful of parquet files within the partition 'date'.

控制它的最佳方法是什么?是通过使用'coalesce()'吗?

What would be the best way to control this? Is it by using 'coalesce()'?

这将如何影响给定分区中创建的文件数量?它取决于我在Spark中工作的执行者数量吗? (当前设置为100).

How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).

推荐答案

您必须重新分配DataFrame以匹配DataFrameWriter

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

尝试一下:

df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

这篇关于Spark DataFrame写入方法,可写入许多小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆