在S3中将sparkdataframe写入.csv文件,然后在pyspark中选择一个名称 [英] Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

查看:139
本文介绍了在S3中将sparkdataframe写入.csv文件,然后在pyspark中选择一个名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我要在S3中将其写入一个.csv文件我使用以下代码:

I have a dataframe and a i am going to write it an a .csv file in S3 i use the following code:

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)

它将.csv文件放在product_profit_weekly文件夹中,目前.csv文件在S3中具有奇怪的名称,当我要编写文件时可以选择一个文件名吗?

it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?

推荐答案

所有spark数据框编写器(df.write .___)均未写入单个文件,而是每个分区写入一个块.我想您会得到一个名为

All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly

和一个名为

part-00000

在这种情况下,您正在做的事情可能效率很低,而且不太闪闪发光"-您正在将所有数据帧分区合并为一个,这意味着您的任务实际上并未并行执行!

In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!

这是另一种模式.要利用所有的火花并行化,这意味着不要合并,并并行写入某个目录.

Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.

如果有100个分区,您将获得:

If you have 100 partitions, you will get:

part-00000
part-00001
...
part-00099

如果您需要将所有内容都保存在一个平面文件中,请编写一个小函数以将其合并.您可以在scala或bash中执行以下操作:

If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:

cat ${dir}.part-* > $flatFilePath

这篇关于在S3中将sparkdataframe写入.csv文件,然后在pyspark中选择一个名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆