Spark dataframe写方法写很多小文件 [英] Spark dataframe write method writing many small files

查看：39 发布时间：2021/11/12 5:33:57 scala apache-spark

本文介绍了Spark dataframe写方法写很多小文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相当简单的工作，将日志文件转换为镶木地板.它正在处理 1.1TB 的数据(分成 64MB - 128MB 的文件 - 我们的块大小是 128MB)，大约有 12000 个文件.

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.

工作如下:

 val events = spark.sparkContext
  .textFile(s"$stream/$sourcetype")
  .map(_.split(" \\|\\| ").toList)
  .collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
  .toDF()

df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")

它收集具有通用架构的事件，转换为 DataFrame，然后作为 parquet 写出.

It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.

我遇到的问题是这会在 HDFS 集群上造成一些 IO 爆炸，因为它试图创建如此多的小文件.

The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.

理想情况下，我只想在日期"分区内创建少量镶木地板文件.

Ideally I want to create only a handful of parquet files within the partition 'date'.

控制这种情况的最佳方法是什么?是通过使用 'coalesce()' 吗?

What would be the best way to control this? Is it by using 'coalesce()'?

这将如何影响在给定分区中创建的文件数量?它是否取决于我在 Spark 中工作的执行者数量?(目前设置为 100).

How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).

Spark dataframe写方法写很多小文件 [英] Spark dataframe write method writing many small files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark dataframe写方法写很多小文件 [英] Spark dataframe write method writing many small files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭