Azure Data Factory缩小而不创建文件夹 [英] Azure Data Factory deflate without creating a folder

查看:83
本文介绍了Azure Data Factory缩小而不创建文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Data Factory v2作业,可将文件从SFTP服务器复制到Azure Data Lake Gen2.

I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.

.csv文件和.zip文件混合在一起(每个文件仅包含一个csv文件).

There is a mix of .csv files and .zip files (each containing only one csv file).

我有一个数据集用于复制csv文件,另一个数据集用于复制zip文件(Compressoin类型设置为ZipDeflate).问题是ZipDeflate创建了一个包含csv文件的新文件夹,我需要它尊重文件夹层次结构而不创建任何文件夹.

I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.

在Azure数据工厂中有可能吗?

Is this possible in Azure Data Factory?

推荐答案

好问题,我遇到了类似的麻烦*,而且似乎没有得到很好的记录.

Good question, I ran into similar trouble* and it doesn't seem to be well documented.

如果我没记错的话,Data Factory假定ZipDeflate可能包含多个文件,并且无论如何似乎都可以创建一个文件夹.

If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.

如果您的Gzip文件只有一个文件,那么它将仅创建一个文件.

If you have Gzip files on the other hand which only have a single file, then it will create only that.

您可能已经知道这一点,但是将其摆在最前沿可以帮助我实现明智的默认数据工厂具有的功能:

You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:

我的理解是Zip标准是一种 archive 格式,正巧使用Deflate算法.作为存档格式,它自然可以包含多个文件.

My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.

例如,gzip只是一种压缩算法,它不支持多个文件(除非先存档tar),因此它将解压缩为仅一个没有文件夹的文件.

Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.

您可能还需要执行其他数据工厂步骤,以获取层次结构并将其复制到平面文件夹,但这会导致随机文件名(您可能会满意,也可能会不满意).对我们来说,这是行不通的,因为下一步我们需要可预测的文件名.

You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.

n.b.数据工厂不会移动文件,而是会复制文件,因此,如果文件很大,可能会很麻烦.但是,您可以通过数据湖存储API或Powershell等触发元数据移动操作.

n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.

*我的情况有点疯狂,因为我正在从源系统接收名为.gz的文件,但实际上是变相的zip文件!最后,最好的选择是让我们的源系统更改为真正的gzip文件.

*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.

这篇关于Azure Data Factory缩小而不创建文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆