在S3中将sparkdataframe写入.csv文件，然后在pyspark中选择一个名称 [英] Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

查看：139 发布时间：2021/4/3 19:21:03 apache-spark amazon-s3 apache-spark-sql spark-dataframe pyspark-sql

本文介绍了在S3中将sparkdataframe写入.csv文件，然后在pyspark中选择一个名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，我要在S3中将其写入一个.csv文件我使用以下代码:

I have a dataframe and a i am going to write it an a .csv file in S3 i use the following code:

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)

它将.csv文件放在product_profit_weekly文件夹中，目前.csv文件在S3中具有奇怪的名称，当我要编写文件时可以选择一个文件名吗?

it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?

推荐答案

所有spark数据框编写器(df.write .___)均未写入单个文件，而是每个分区写入一个块.我想您会得到一个名为

All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly

和一个名为

part-00000

在这种情况下，您正在做的事情可能效率很低，而且不太闪闪发光"-您正在将所有数据帧分区合并为一个，这意味着您的任务实际上并未并行执行！

In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!

这是另一种模式.要利用所有的火花并行化，这意味着不要合并，并并行写入某个目录.

Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.

如果有100个分区，您将获得:

If you have 100 partitions, you will get:

part-00000
part-00001
...
part-00099

如果您需要将所有内容都保存在一个平面文件中，请编写一个小函数以将其合并.您可以在scala或bash中执行以下操作:

If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:

cat ${dir}.part-* > $flatFilePath

这篇关于在S3中将sparkdataframe写入.csv文件，然后在pyspark中选择一个名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在S3中将sparkdataframe写入.csv文件，然后在pyspark中选择一个名称 [英] Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在S3中将sparkdataframe写入.csv文件，然后在pyspark中选择一个名称 [英] Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭