使用Azure Data Lake Analytics与传统ETL方法的原因 [英] Reasons to use Azure Data Lake Analytics vs Traditional ETL approach

查看:89
本文介绍了使用Azure Data Lake Analytics与传统ETL方法的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与正在研究多年的传统ETL SSIS方案相比,我正在考虑使用最近研究的Data Lake技术.

I'm considering using Data Lake technologies which I have been studying for the latest weeks, compared with the traditional ETL SSIS scenarios, which I have been working with for so many years.

我认为Data Lake与大数据非常相关,但是使用Data Lake技术与SSIS之间的界线在哪里?

I think of Data Lake as something very linked to big data, but where is the line between using Data Lake technolgies vs SSIS?

使用Data Lake技术处理25MB〜100MB〜300MB文件有什么优势吗?并行性?灵活性?将来可扩展吗? 当要加载的文件没有U-SQL最佳方案那么大时,性能会有所提高吗?

Is there any advantage of using Data Lake technologies with 25MB ~100MB ~ 300MB files? Parallelism? flexibility? Extensible in the future? Is there any performance gain when the files to be loaded are not so big as U-SQL best scenario...

您有什么想法?就像用锤子敲碎螺母一样吗? 请不要犹豫,问我任何问题以澄清情况. 在此先感谢!

What are your thoughts? Would it be like using a hammer to crack a nut? Please, don't hesitate to ask me any questions to clarify the situation. Thanks in advance!!

21/03编辑 更多说明:

  1. 必须在云上
  2. 我考虑使用ADL的原因是因为云中没有SSIS的替代品.有ADF,但是不一样,它可以整理数据,但是不如SSIS灵活
  3. 我以为可以对某些(基本)转换使用U-SQL,但是我看到了一些问题
    • 我无法执行许多基本操作:循环,更新,在SQL中编写日志...
    • 输出只能是U-SQL表或文件.这种架构看起来并不好(尽管U-SQL对于大文件非常好,如果我需要额外的步骤将文件导出到另一个DB或DWH)-也许这就是在大数据仓库中完成的方式...我不知道
    • 在我的测试中,一个1MB的文件需要40秒钟,而500MB的文件则需要1:15s.我无法证明40 MB的进程占用1MB的空间(再加上使用ADF上传到数据库/数据仓库)
    • 对于用户而言,该代码看起来是无组织的,因为具有许多基本验证的脚本将是U-SQL脚本太长.
  1. has to be on the cloud
  2. the reason I considered about using ADL is because there is no substitution for SSIS in the cloud. There is ADF, but it's not the same, it orchestrates the data, but it's not so flexible as SSIS
  3. I thought I could use U-SQL for some (basic) transformations but I see some problems
    • There are many basic things I cannot do: loops, updates, writing logs in a SQL...
    • The output can only be a U-SQL table or a file. The architecture doesn't look good this way (despite U-SQL is very good with big files, if I need an extra step to export the file to another DB or DWH) - Or maybe this is the way it's done in Big Data Warehouses... I don't know
    • In my tests, It takes 40s for a 1MB file, and 1:15s for a 500MB file. I cannot justify a 40s process for 1MB (plus uploading to the Database/Data Warehouse with ADF)
    • The code looks unorganised for a user, as the scripts with many basic validations will be U-SQL scripts too long.

不要误会我的意思,我真的很喜欢ADL技术,但是我认为目前它是针对非常具体的东西,仍然无法替代云中的SSIS.你这是什么我错了吗?

Don't get me wrong, I really like ADL techonologies, but I think that for now, it's for something very specific and still there is no substitution for SSIS in the cloud. What do you thing? Am I wrong?

推荐答案

对我来说,如果数据具有高度的结构性和关系性,那么适合使用关系数据库.在Azure中,您有几种选择:

For me, if the data is highly structured and relational, the right place for it is a relational database. In Azure you have several choices:

    VM上的
  1. SQL Server(IaaS) 在VM上运行的普通SQL Server,您必须自己安装,配置和管理它,但是您可以获得该产品的全部灵活性.
  2. Azure SQL数据库 PaaS数据库选项以较小的卷为目标,但现在已达到 4TB .普通SQL Server的所有功能都具有较低的TCO,并且可以使用.
  3. Azure SQL数据仓库(ADW) MPP产品适用于大型仓库.对我来说,进入标准是仓库的大小至少为1TB,可能更像10TB.拥有小容量的MPP确实不值得.
  1. SQL Server on a VM (IaaS) Ordinary SQL Server running on a VM, you have to install, configure and manage it yourself but you get the full flexibility of the product.
  2. Azure SQL Database PaaS database option targetted at smaller volumes but now up to 4TB. All of the features of normal SQL Server with potentially lower TCO and the option to scale up or down using tiers.
  3. Azure SQL Data Warehouse (ADW) MPP product suitable for large warehouses. For me, the entry criteria is warehouses at least 1TB in size, and probably more like 10TB. It's really not worth having a MPP for small volumes.

对于所有数据库选项,您都可以使用群集的列存储索引(ADW中的默认值),该索引可以在5到10倍之间进行大量压缩.

For all database options you can use clustered columnstore indexes, (the default in ADW), which can give massive compression, between 5x and 10x.

一年的每天400MB总计约143GB,这对于现代数据仓库而言并不算多,通常以TB为单位.

400MB per day for a year totals ~143GB, which honestly is not that much in modern data warehouse terms, which are normally measured in terabytes (TB).

Azure数据湖分析(ADLA)出现的地方是在普通SQL中做不到的事情,例如:

Where Azure Data Lake Analytics (ADLA) comes in, is doing things you cannot do in ordinary SQL, like:

  • 将SQL的C#功能与强大的查询结合在一起-示例
  • combine the power of C# with SQL for powerful queries - example here
  • dealing with unstructured files like images, xml or JSON - example here
  • using RegEx
  • scale out R processing - example here

ADLA还提供了联合查询功能,可以查询数据所在的位置",即将数据库中的结构化数据与湖中的非结构化数据汇总在一起.

ADLA also offers federated queries, the ability to "query data where it lives", ie bring together structured data from your database and unstructured data from your lake.

您的决定似乎与是否应该使用云有关.如果您需要云的弹性和可伸缩功能,则Azure数据工厂是用于在云中的不同位置移动数据的工具.

Your decision seems more to do with whether or not you should be using the cloud. If you need the elastic and scalable features of cloud then Azure Data Factory is the tool for moving data from place to place in the cloud.

HTH

这篇关于使用Azure Data Lake Analytics与传统ETL方法的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆