如何控制从Spark DataFrame写入的输出文件的数量? [英] How can I control the number of output files written from Spark DataFrame?
问题描述
使用Spark流从Kafka主题中读取Json数据.
我使用DataFrame来处理数据,后来我希望将输出保存到HDFS文件中.问题是使用:
Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:
df.write.save("append").format("text")
产生许多文件,有些文件很大,有些甚至是0字节.
Yields many files some are large, and some are even 0 bytes.
是否可以控制输出文件的数量?另外,为了避免出现相反的"问题,是否有一种方法也可以限制每个文件的大小,以便当当前文件达到一定大小/行数时会写入一个新文件?
Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?
推荐答案
输出文件的数量等于Dataset
的分区数量.这意味着您可以通过多种方式来控制它,具体取决于上下文:
The number of the output files is equal to the number of partitions of the Dataset
This means you can control it in a number of way, depending on the context:
- 对于
Datasets
,没有广泛的依赖性,您可以使用阅读器特定的参数来控制输入 - 对于具有广泛依赖性的
Datasets
,您可以使用spark.sql.shuffle.partitions
参数控制分区的数量. - 您可以独立于谱系
coalesce
或repartition
.
- For
Datasets
with no wide dependencies you can control input using reader specific parameters - For
Datasets
with wide dependencies you can control number of partitions withspark.sql.shuffle.partitions
parameter. - Independent of the lineage you can
coalesce
orrepartition
.
是否有一种方法也可以限制每个文件的大小,以便当当前文件达到一定大小/行数时会写入一个新文件?
is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?
不.对于内置的作家,严格来说是1:1的关系.
No. With built-in writers it is strictly 1:1 relationship.
这篇关于如何控制从Spark DataFrame写入的输出文件的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!