节省大量星火数据框在S3一个JSON文件 [英] Save a large Spark Dataframe as a single json file in S3

查看：262 发布时间：2016/5/22 15:55:46 apache-spark dataframe apache-spark-sql pyspark

本文介绍了节省大量星火数据框在S3一个JSON文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我试着去一个Spark数据框（的20G以上）保存到Amazon S3的一个JSON文件，我的code保存数据帧是这样的：

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

但即时得到来自S3的错误您提出的上传超过了最大允许的大小，我知道，亚马逊允许的最大文件大小为5GB。

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

时有可能使用S3多部分上传星火？或者有另一种方式来解决这个问题？

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

顺便说一下，我需要在一个文件中的数据，因为另一个用户将下载后的。

Btw i need the data in a single file because another user is going to download it after.

*即时通讯使用Apache 1.3.1火花与火花EC2脚本创建的3个节点的集群。

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

非常感谢

我会尝试的大数据帧分成一系列较小dataframes，你然后附加到目标相同的文件。

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)

这篇关于节省大量星火数据框在S3一个JSON文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文