常规数据处理 - Azure DW中的大数据速度缓慢。寻找想法以及如何解决它 [英] General Data Processing - slowness with big data in the Azure DW. Looking for thoughts and how you tackle it

查看:178
本文介绍了常规数据处理 - Azure DW中的大数据速度缓慢。寻找想法以及如何解决它的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我进入了一家拥有此设置的公司:


我们每天通过sFTP服务器下载6000万条记录(存储在几百个csv文件中)并存储它们在blob存储中。


然后我们将下载移动到数据湖存储并使用U-SQL和Azure数据湖分析来分离这些数据并将它们存储在外部表中(我看到它们)来自Azure SQL DW)。 我通常使用大约20个分析单位,并且能够在大约5-10分钟内通过这些6000万美元压缩
- 很好。


现在最后一步是在Azure数据中处理仓库和我有问题的地方:


我从外部表中获取数据(那些6000万条记录)并将它们放入事实表中。 它们被带入事实表的方式是通过"Create Table As As Select"和"Create Table As As Select"。使用联合以删除重复项,然后重命名,
并删除。 例如,一些伪代码:

将表创建为NewFullFactTable 
从NewDaily60MMRecs中选择*
Union
从FullFactTable中选择*

将FullFactTable重命名为OldFactTable
将NewFullFactTable重命名为FullFactTable
删除表OldFactTable

问题在于我们每天有6000万条记录,目前我正在重新加载2年的数据。 这2年的数据将大约为43亿。


好吧,如果你看看我们在做什么,我们实际上是在每天每天都重新创建主表。 我刚开始每次1个月重新加载这些数据,但请查看每日运行情况,看看运行这几个月所需的时间越来越长。 
令人非常担心的是,如果没有6000个DWU并且运行半天,我甚至无法度过第一年。以下是"创建表格"的时间。上面的伪代码:


第1个月:FactRowsBeforeLoad:0

第1个月:FactRowsAfterLoad:  18亿

第1个月:分配的DWU:  2,000

第1个月:运行时间: 约30分钟<
第2个月:FactRowsBeforeLoad:1.8亿b
第2个月:FactRowsAfterLoad:  3.7亿b
第2个月:分配的DWU:  2,000

第2个月:运行时间: 约45分钟左右
第3个月:FactRowsBeforeLoad:3.7亿b
第3个月:FactRowsAfterLoad:  4.8亿b
第3个月:分配的DWU:  2,000

第3个月:运行时间: 约1小时38分钟

第4个月:FactRowsBeforeLoad:4.8亿b
第4期:FactRowsAfterLoad:  6.2亿b
第4个月:分配的DWU:  3,000

第4个月:运行时间: 约1小时54分钟

------------------------------------- -----------------


第5个月的第1天:FactRowsBeforeLoad:6.20亿日元
每月1天5:FactRowsAfterLoad:  6.27亿第2个月每天1美元:分配的DWU:  3,000¥
第5个月的第1天:运行时间: 大约1小时30分钟



我做了一些根本错误的事情吗? 我应该完全做其他事吗? 我们从一名顾问那里继承了这一点,我作为一名新员工进来,并试图让它成为现实。


即使你没有答案,我将非常感谢您对该过程的积极和批评。


马特









解决方案

查看您的用例我看到的主要问题是您每次都在重写数据仓库中的数据加载的时间。每个月你都会重写整个表格,随着表格越来越大,这会累积影响力。为了说明我使用提供的数据创建了
快速excel。当两年加载时,当前方法移动456亿条记录以加载36亿。这会损害性能。



为了解决这个问题,你可以做两件事情,具体取决于你的解决方案的细节:



  •  加载所有月/年,然后做union all。 

我会使用临时表每月加载到SQL DW中。然后执行UNION ALL并写入生产表。这将导致您只重写两次数据。



  • 加载所有月/年,然后分区切换到生产表

更高级的加载计划是使用分区开关将数据移动到生产表中。分区切换导致在生产表模式中仅将数据写入一次。然后是"写"。生产表是一个元数据
操作。有关此模式的更多信息,请
here



还要接受的是数据移动需要时间。 @DWU 2000使用您的数据,您可以查看所有数据的大约12小时的加载时间。您可以在文件级别进行优化以提高性能。
您的数据格式是什么格式?


希望到目前为止有所帮助。 


Casey


I came into a company that had this setup:

Every day we download 60 Million records (stored in a few hundred csv files) via an sFTP server and store them in blob storage.

Then we move the downloads to the data lake store and use U-SQL and Azure data lake analytics to separate that data and store them in External tables (As I see them from the Azure SQL DW).  I typically use about 20 Analytical units and am able to zip through those 60 Million in about 5-10 minutes - nice.

The final step is now processing in the Azure Data Warehouse and where I have my question:

I take the data (those 60 million records) from the external tables and bring them into a fact table.  The way they are brought into a fact table is via a "Create Table As Select" with a union so that the duplicates are removed, then a rename, and drop.  For example, some pseudo code:

Create Table as NewFullFactTable
Select * from NewDaily60MMRecs
Union
Select * from FullFactTable

Rename FullFactTable to OldFactTable
Rename NewFullFactTable to FullFactTable
Drop Table OldFactTable

The problem is that we have 60 million records every day and I am currently back-loading 2 years worth of data.  This 2 years worth of data will be about 43 BILLION.

Well, if you look at what we are doing, we are essentially recreating the Main table EVERY DAILY RUN.  I just started back-loading this data 1 month at a time, but look at the daily run and look at the increasing time it takes to run the months.  It's very worrying that I wont even be able to get through the first year without 6000 DWUs and running for half a day. Here's what the times look like just for this "Create Table As" pseudo code above:

Month 1: FactRowsBeforeLoad: 0
Month 1: FactRowsAfterLoad:  1.8 Billion
Month 1: DWUs Allocated:  2,000
Month 1: Runtime:  About 30 minutes
Month 2: FactRowsBeforeLoad: 1.8 Billion
Month 2: FactRowsAfterLoad:  3.7 Billion
Month 2: DWUs Allocated:  2,000
Month 2: Runtime:  About 45 minutes
Month 3: FactRowsBeforeLoad: 3.7 Billion
Month 3: FactRowsAfterLoad:  4.8 Billion
Month 3: DWUs Allocated:  2,000
Month 3: Runtime:  About 1 hour 38 minutes
Month 4: FactRowsBeforeLoad: 4.8 Billion
Month 4: FactRowsAfterLoad:  6.2 Billion
Month 4: DWUs Allocated:  3,000
Month 4: Runtime:  About 1 hour 54 minutes
------------------------------------------------------

1 Day in Month 5: FactRowsBeforeLoad: 6.20 Billion
1 Day in Month 5: FactRowsAfterLoad:  6.27 Billion
1 Day in Month 5: DWUs Allocated:  3,000
1 Day in Month 5: Runtime:  About 1 hour 30 minutes

Am I doing something fundamentally wrong?  Should I be doing something else entirely?  We inherited this from a consultant and I came in as a new employee and am trying to get this up and running.

Even if you don't have answers, I would appreciate your thoughts both positive and critical of the process.

Matt



解决方案

Looking through your use case the main issue I see is that you are rewriting data in the data warehouse every time you load. Every month you rewrite the entire table which will accumulate impact as the table gets larger and larger. To illustrate I created a quick excel using the data provided. By the time two years are loaded the current method moves 456 Billion records to load 36 Billion. That will hurt performance.

To get around this you can do two things depending on the particulars of your solution:

  •  Load all months/years, then do union all. 

I would load each month into SQL DW using a temporary table. Then do the UNION ALL and write into the production table. This would cause you to rewrite the data only twice.

  • Load all months/years, then partition switch into production table

A more advanced loading plan is to use partition switch to move data into the production table. Partition switching results in writing the data only 1 time in the production tables schema. Then the "write" to the production table is a metadata operation. There is more information about this pattern here.

Something to also accept is that the data movement will take time. @DWU 2000 with your data you are looking at approximately 12 hours of load time for all of your data. There may be optimization that you could be made on the file level to increase performance. What format is your data written in?

hope that helps so far. 

Casey


这篇关于常规数据处理 - Azure DW中的大数据速度缓慢。寻找想法以及如何解决它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆