Spark如何在写入时/写入后为DataFrame指定结果文件的数量 [英] Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

查看:126
本文介绍了Spark如何在写入时/写入后为DataFrame指定结果文件的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到了几个有关将单个文件写入hdfs的问题,看来使用 coalesce(1)就足够了.

例如;

  df.coalesce(1).write.mode("overwrite").format(format).save(location) 

但是如何指定保存操作后将写入的确切"文件数量?

所以我的问题是

如果我进行写操作时具有包含100个分区的数据帧,它将写入100个文件吗?

如果在调用 repartition(50)/coalsesce(50)后进行写操作时,如果我的数据帧包含100个分区,它将写入50个文件吗?

spark中是否有一种方法可以在将数据帧写入HDFS时指定文件的最终数量?

谢谢

解决方案

输出文件的数量通常等于写入任务(分区)的数量.在正常情况下,它不能小(每个作者都写自己的部分,并且多个任务不能写到同一文件),但是如果格式具有非标准行为或使用 partitionBy ,则可以大.>

通常

如果我进行写操作时具有包含100个分区的数据帧,它将写入100个文件吗?

如果我在调用repartition(50)/coalsesce(50)后执行写操作时,如果数据帧包含100个分区,它将写入50个文件吗?

是的.

spark中是否有一种方法可以在将数据帧写入HDFS时指定文件的最终数量?

不.

I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient.

E.g;

df.coalesce(1).write.mode("overwrite").format(format).save(location)

But how can I specify "exact" number of files that will written after save operation?

So my question is;

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

Thanks

解决方案

Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy is used.

Normally

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

Yes

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

And yes.

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

No.

这篇关于Spark如何在写入时/写入后为DataFrame指定结果文件的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆