Azure Data Lake存储和数据工厂-临时GUID文件夹和文件 [英] Azure Data Lake Storage and Data Factory - Temporary GUID folders and files

查看:100
本文介绍了Azure Data Lake存储和数据工厂-临时GUID文件夹和文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Azure数据湖存储(ADLS),其目标是从Blob存储读取并写入ADLS的Azure数据工厂(ADF)管道.在执行期间,我注意到在输出ADLS中创建了一个文件夹,该文件夹在源数据中不存在.该文件夹具有用于名称的GUID和其中的许多文件,也包括GUID.该文件夹是临时文件夹,大约30秒钟后消失.

I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.

这是ADLS元数据索引编制的一部分吗? ADF是否在处理过程中使用了它?尽管它显示在门户网站的数据资源管理器中,但它是否通过API显示?我担心这可能会带来一些问题,即使它是一个临时结构.

Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.

任何见解都值得赞赏-Google的出现很少.

Any insight appreciated - a Google turned up little.

推荐答案

因此,无论您使用哪种方法上传和复制数据,Azure Data Lake Storage都能做到.它不是特定于Data Factory的,也不是您可以控制的.

So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.

对于大文件,基本上可以并行处理单个文件的读/写操作.然后,对于并行操作的每个线程,您都会在临时目录中看到多个较小的文件.完成后,该过程会将线程连接到单个预期的目标文件中.

For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.

比较:这与PolyBase在SQLDW中的8个外部读取器以512MB块命中一个文件的方式类似.

Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.

我了解您在这里的担忧.我还为此进行了努力,因为操作失败并且不清理临时文件.我的建议是在指定目标文件路径时对下游服务明确.

I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.

另一件事,我在使用Visual Studio Data Lake文件资源管理器工具上传大文件时遇到了问题.有时并行线程没有正确连接到单个线程中,并导致结构化数据集损坏.这是与4-8GB区域中的文件一起使用的.被警告!

One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!

旁注.我发现PowerShell对于处理到Data Lake Store的上载最为可靠.

Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.

希望这会有所帮助.

这篇关于Azure Data Lake存储和数据工厂-临时GUID文件夹和文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆