Data Lake Analytics U-SQL EXTRACT速度(本地与Azure) [英] Data Lake Analytics U-SQL EXTRACT speed (Local vs Azure)

查看:83
本文介绍了Data Lake Analytics U-SQL EXTRACT速度(本地与Azure)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

曾经考虑使用Azure Data Lake Analytics功能来尝试操纵一些我存储在Azure Blob存储中的Gzip xml数据,但是我遇到了一个有趣的问题.本质上,当在本地使用U-SQL处理500个这些xml文件时,处理时间非常快,在本地使用1个AU大约需要40秒(这似乎是限制).但是,当我们使用5 AU在Azure内部运行此功能时,处理过程将花费17分钟以上.

Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.

我们最终希望将其扩展到约20,000个文件,甚至更多,但已经减少了尝试测量速度的设置.

We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.

每个文件包含50个xml对象的集合(子元素中包含的详细信息的数量各不相同),使用Gzip时,文件大约为1 MB,如果不使用,则为5MB至10MB.在u-sql脚本的EXTRACT部分中花费了99%的处理时间.

Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements), the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not. 99% of the time processing time is spent within the EXTRACT section of the u-sql script.

尝试过的事情

在处理之前将文件解压缩,这与压缩版本大约花费了相同的时间,当然距离我在本地看到的40秒还很远. 将数据从Blob存储移动到Azure Data Lake存储所需的时间完全相同. 暂时从文件中删除了大约一半的数据,然后重新运行,但令人惊讶的是,这两个过程都没有花费超过一分钟的时间. 添加了更多的非盟以延长处理时间,这非常有效,但由于会产生成本,因此不是长期的解决方案. 在我看来,从Azure Blob存储/Azure Data Lake获取数据时似乎存在主要瓶颈.我错过了明显的东西吗?

Unzipped the files before processing, this took roughly the same time as the zipped version, certainly nowhere near the 40 seconds I was seeing locally. Moved the data from Blob storage to Azure Data Lake storage, took exactly the same length of time. Temporarily Removed about half of the data from the files and re-ran, surprisingly this didn’t take more than a minute off either. Added more AU’s to increase the processing time, this worked extremely well but isn’t a long term solution due to the costs that would be incurred. It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake. Am I missing something obvious.

P.S.让我知道是否需要更多信息.

P.S. Let me know if you need any more information.

谢谢

尼克.

推荐答案

请参阅

See slide 31 of https://www.slideshare.net/MichaelRys/best-practices-and-performance-tuning-of-usql-in-azure-data-lake-sql-konferenz-2018. There is a preview option

SET @@ FeaturePreviews ="InputFileGrouping:on";

SET @@FeaturePreviews="InputFileGrouping:on";

将小文件分成有限的顶点.

which groups small files into limited vertices.

这篇关于Data Lake Analytics U-SQL EXTRACT速度(本地与Azure)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆