重命名Azure Blob存储中的Spark输出CSV [英] Renaming spark output csv in azure blob storage

查看：99 发布时间：2020/9/17 22:36:41 python azure apache-spark pyspark azure-storage

本文介绍了重命名Azure Blob存储中的Spark输出CSV的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Databricks笔记本电脑设置，其工作方式如下:

I have a Databricks notebook setup that works as the following;

pyspark与Blob存储帐户的连接详细信息
通过spark数据框读取文件
转换为熊猫Df
熊猫Df上的数据建模
转换为火花Df
在单个文件中写入blob存储

我的问题是，您无法命名文件输出文件，而我需要一个静态的csv文件名.

My problem is, that you can not name the file output file, where I need a static csv filename.

在pyspark中可以重命名吗?

Is there way to rename this in pyspark?

## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""

## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"

## Connection string to connect to blob storage
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

数据转换后输出文件

dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").save(file_location_new)

然后将文件写为"part-00000-tid-336943946930983 ..... csv"

目标是拥有"Output.csv"

我看过的另一种方法只是在python中重新创建此方法，但尚未在文档中找到如何将文件输出回blob存储的方法.

Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.

我知道从Blob存储中检索的方法是通过

I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs

非常感谢您的帮助.

推荐答案

Hadoop/Spark将每个分区的计算结果并行输出到一个文件中，因此您会在HDFS输出路径(如Output/)中看到许多part-<number>-....文件由你命名.

Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.

如果要将计算的所有结果输出到一个文件中，可以通过命令hadoop fs -getmerge /output1/part* /output2/Output.csv合并它们，或者像使用coalesce(1)函数一样，通过1设置化简处理的数量.

If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.

因此，在您的情况下，只需调整调用这些函数的顺序即可使coalease函数在save函数的前面被调用，如下所示.

So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.

dfspark.write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)

这篇关于重命名Azure Blob存储中的Spark输出CSV的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

重命名Azure Blob存储中的Spark输出CSV [英] Renaming spark output csv in azure blob storage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

重命名Azure Blob存储中的Spark输出CSV [英] Renaming spark output csv in azure blob storage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭