如何控制从 Spark DataFrame 写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

查看：48 发布时间：2021/11/14 22:03:50 scala apache-spark apache-kafka apache-spark-sql spark-streaming

本文介绍了如何控制从 Spark DataFrame 写入的输出文件的数量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 Spark 流从 Kafka 主题读取 Json 数据.
我使用 DataFrame 处理数据，稍后我希望将输出保存到 HDFS 文件.问题是使用:

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

产生许多文件，有些文件很大，有些甚至是 0 字节.

Yields many files some are large, and some are even 0 bytes.

有没有办法控制输出文件的数量?另外，为了避免相反"问题，有没有办法同时限制每个文件的大小，以便在当前达到特定大小/行数时写入一个新文件?

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

如何控制从 Spark DataFrame 写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何控制从 Spark DataFrame 写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭