重命名Azure Blob存储中的Spark输出CSV [英] Renaming spark output csv in azure blob storage

查看:99
本文介绍了重命名Azure Blob存储中的Spark输出CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Databricks笔记本电脑设置,其工作方式如下:

I have a Databricks notebook setup that works as the following;

  • pyspark与Blob存储帐户的连接详细信息
  • 通过spark数据框读取文件
  • 转换为熊猫Df
  • 熊猫Df上的数据建模
  • 转换为火花Df
  • 在单个文件中写入blob存储

我的问题是,您无法命名文件输出文件,而我需要一个静态的csv文件名.

My problem is, that you can not name the file output file, where I need a static csv filename.

在pyspark中可以重命名吗?

Is there way to rename this in pyspark?

## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""

## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"

## Connection string to connect to blob storage
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

数据转换后输出文件

dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").save(file_location_new)

然后将文件写为"part-00000-tid-336943946930983 ..... csv"

目标是拥有"Output.csv"

我看过的另一种方法只是在python中重新创建此方法,但尚未在文档中找到如何将文件输出回blob存储的方法.

Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.

我知道从Blob存储中检索的方法是通过

I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs

非常感谢您的帮助.

推荐答案

Hadoop/Spark将每个分区的计算结果并行输出到一个文件中,因此您会在HDFS输出路径(如Output/)中看到许多part-<number>-....文件由你命名.

Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.

如果要将计算的所有结果输出到一个文件中,可以通过命令hadoop fs -getmerge /output1/part* /output2/Output.csv合并它们,或者像使用coalesce(1)函数一样,通过1设置化简处理的数量.

If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.

因此,在您的情况下,只需调整调用这些函数的顺序即可使coalease函数在save函数的前面被调用,如下所示.

So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.

dfspark.write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)

这篇关于重命名Azure Blob存储中的Spark输出CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆