在 S3 中将大型 Spark Dataframe 保存为单个 json 文件 [英] Save a large Spark Dataframe as a single json file in S3

查看：73 发布时间：2021/11/14 22:16:49 apache-spark dataframe apache-spark-sql pyspark

本文介绍了在 S3 中将大型 Spark Dataframe 保存为单个 json 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我正在尝试将 Spark DataFrame(超过 20G)保存到 Amazon S3 中的单个 json 文件中，我保存数据帧的代码如下:

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

但是我从 S3 收到错误消息您提议的上传超过了允许的最大大小"，我知道亚马逊允许的最大文件大小是 5GB.

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Spark 可以使用 S3 分段上传吗?或者有其他方法可以解决这个问题?

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

顺便说一句，我需要单个文件中的数据，因为另一个用户将在之后下载它.

Btw i need the data in a single file because another user is going to download it after.

*我在用 spark-ec2 脚本创建的 3 节点集群中使用 apache spark 1.3.1.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

非常感谢

我会尝试将大数据帧分成一系列较小的数据帧，然后将它们附加到目标中的同一个文件中.

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)

这篇关于在 S3 中将大型 Spark Dataframe 保存为单个 json 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文