在S3中将sparkdataframe写入.csv文件,然后在pyspark中选择一个名称 [英] Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
问题描述
我有一个数据框,我要在S3中将其写入一个.csv文件我使用以下代码:
I have a dataframe and a i am going to write it an a .csv file in S3 i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
它将.csv文件放在product_profit_weekly文件夹中,目前.csv文件在S3中具有奇怪的名称,当我要编写文件时可以选择一个文件名吗?
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
推荐答案
所有spark数据框编写器(df.write .___)均未写入单个文件,而是每个分区写入一个块.我想您会得到一个名为
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
和一个名为
part-00000
在这种情况下,您正在做的事情可能效率很低,而且不太闪闪发光"-您正在将所有数据帧分区合并为一个,这意味着您的任务实际上并未并行执行!
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
这是另一种模式.要利用所有的火花并行化,这意味着不要合并,并并行写入某个目录.
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
如果有100个分区,您将获得:
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
如果您需要将所有内容都保存在一个平面文件中,请编写一个小函数以将其合并.您可以在scala或bash中执行以下操作:
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath
这篇关于在S3中将sparkdataframe写入.csv文件,然后在pyspark中选择一个名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!