Spark dataframe写方法写很多小文件 [英] Spark dataframe write method writing many small files

查看:39
本文介绍了Spark dataframe写方法写很多小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当简单的工作,将日志文件转换为镶木地板.它正在处理 1.1TB 的数据(分成 64MB - 128MB 的文件 - 我们的块大小是 128MB),大约有 12000 个文件.

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.

工作如下:

 val events = spark.sparkContext
  .textFile(s"$stream/$sourcetype")
  .map(_.split(" \\|\\| ").toList)
  .collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
  .toDF()

df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")

它收集具有通用架构的事件,转换为 DataFrame,然后作为 parquet 写出.

It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.

我遇到的问题是这会在 HDFS 集群上造成一些 IO 爆炸,因为它试图创建如此多的小文件.

The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.

理想情况下,我只想在日期"分区内创建少量镶木地板文件.

Ideally I want to create only a handful of parquet files within the partition 'date'.

控制这种情况的最佳方法是什么?是通过使用 'coalesce()' 吗?

What would be the best way to control this? Is it by using 'coalesce()'?

这将如何影响在给定分区中创建的文件数量?它是否取决于我在 Spark 中工作的执行者数量?(目前设置为 100).

How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).

推荐答案

你必须重新分区你的 DataFrame 以匹配 DataFrameWriter 的分区

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

试试这个:

df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

这篇关于Spark dataframe写方法写很多小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆