重命名Azure Blob存储中的Spark输出CSV [英] Renaming spark output csv in azure blob storage
问题描述
我有一个Databricks笔记本电脑设置,其工作方式如下:
I have a Databricks notebook setup that works as the following;
- pyspark与Blob存储帐户的连接详细信息
- 通过spark数据框读取文件
- 转换为熊猫Df
- 熊猫Df上的数据建模
- 转换为火花Df
- 在单个文件中写入blob存储
我的问题是,您无法命名文件输出文件,而我需要一个静态的csv文件名.
My problem is, that you can not name the file output file, where I need a static csv filename.
在pyspark中可以重命名吗?
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
数据转换后输出文件
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
然后将文件写为"part-00000-tid-336943946930983 ..... csv"
目标是拥有"Output.csv"
我看过的另一种方法只是在python中重新创建此方法,但尚未在文档中找到如何将文件输出回blob存储的方法.
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
非常感谢您的帮助.
推荐答案
Hadoop/Spark将每个分区的计算结果并行输出到一个文件中,因此您会在HDFS输出路径(如Output/
)中看到许多part-<number>-....
文件由你命名.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-....
files in a HDFS output path like Output/
named by you.
如果要将计算的所有结果输出到一个文件中,可以通过命令hadoop fs -getmerge /output1/part* /output2/Output.csv
合并它们,或者像使用coalesce(1)
函数一样,通过1
设置化简处理的数量.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv
or set the number of reduce processes with 1
like using coalesce(1)
function.
因此,在您的情况下,只需调整调用这些函数的顺序即可使coalease
函数在save
函数的前面被调用,如下所示.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease
function called at the front of save
function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
这篇关于重命名Azure Blob存储中的Spark输出CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!