如何控制从 Spark DataFrame 写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

查看:48
本文介绍了如何控制从 Spark DataFrame 写入的输出文件的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Spark 流从 Kafka 主题读取 Json 数据.
我使用 DataFrame 处理数据,稍后我希望将输出保存到 HDFS 文件.问题是使用:

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

产生许多文件,有些文件很大,有些甚至是 0 字节.

Yields many files some are large, and some are even 0 bytes.

有没有办法控制输出文件的数量?另外,为了避免相反"问题,有没有办法同时限制每个文件的大小,以便在当前达到特定大小/行数时写入一个新文件?

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

推荐答案

输出文件的数量等于 Dataset 的分区数量这意味着你可以在多个方式,取决于上下文:

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

  • 对于没有广泛依赖关系的数据集,您可以使用阅读器特定参数控制输入
  • 对于具有广泛依赖关系的 Datasets,您可以使用 spark.sql.shuffle.partitions 参数控制分区数.
  • 独立于血统,您可以合并重新分区.
  • For Datasets with no wide dependencies you can control input using reader specific parameters
  • For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
  • Independent of the lineage you can coalesce or repartition.

有没有办法限制每个文件的大小,以便在当前达到特定大小/行数时写入一个新文件?

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

没有.内置作者是严格的 1:1 关系.

No. With built-in writers it is strictly 1:1 relationship.

这篇关于如何控制从 Spark DataFrame 写入的输出文件的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆