Spark保存(写入)镶木地板仅一个文件 [英] Spark save(write) parquet only one file

查看:100
本文介绍了Spark保存(写入)镶木地板仅一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我写

dataFrame.write.format("parquet").mode("append").save("temp.parquet")

在temp.parquet文件夹中 我得到与行号相同的文件号

我认为我对镶木地板不是很了解,但这很自然吗?

解决方案

使用


EDIT-1

仔细观察后,,最好使用

in temp.parquet folder i got the same file numbers as the row numbers

i think i'm not fully understand about parquet but is it natural?

解决方案

Use coalesce before write operation

dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")


EDIT-1

Upon a closer look, the docs do warn about coalesce

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)

Therefore as suggested by @Amar, it's better to use repartition

这篇关于Spark保存(写入)镶木地板仅一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆