如何控制从Spark DataFrame写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

查看：1453 发布时间：2020/9/4 19:16:50 scala apache-spark apache-kafka apache-spark-sql spark-streaming

本文介绍了如何控制从Spark DataFrame写入的输出文件的数量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用Spark流从Kafka主题中读取Json数据.
我使用DataFrame来处理数据，后来我希望将输出保存到HDFS文件中.问题是使用:

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

产生许多文件，有些文件很大，有些甚至是0字节.

Yields many files some are large, and some are even 0 bytes.

是否可以控制输出文件的数量?另外，为了避免出现相反的"问题，是否有一种方法也可以限制每个文件的大小，以便当当前文件达到一定大小/行数时会写入一个新文件?

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

推荐答案

输出文件的数量等于Dataset的分区数量.这意味着您可以通过多种方式来控制它，具体取决于上下文:

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

对于Datasets，没有广泛的依赖性，您可以使用阅读器特定的参数来控制输入
对于具有广泛依赖性的Datasets，您可以使用spark.sql.shuffle.partitions参数控制分区的数量.
您可以独立于谱系coalesce或repartition.

For Datasets with no wide dependencies you can control input using reader specific parameters
For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
Independent of the lineage you can coalesce or repartition.

是否有一种方法也可以限制每个文件的大小，以便当当前文件达到一定大小/行数时会写入一个新文件?

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

不.对于内置的作家，严格来说是1:1的关系.

No. With built-in writers it is strictly 1:1 relationship.

这篇关于如何控制从Spark DataFrame写入的输出文件的数量?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何控制从Spark DataFrame写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何控制从Spark DataFrame写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭