多个数据流与所有转换合而为一 [英] Multiple Data flows vs all Transformations in one

查看:57
本文介绍了多个数据流与所有转换合而为一的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Azure数据工厂的新手,并不是所有人都熟悉幕后运行的后端处理.我想知道与将所有转换都包含在一个数据流中相比,并行运行几个数据流是否会对性能产生影响.

Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple of data flows in parallel when compared to having all the transformations in one data flow.

我正在尝试通过不存在的转换来暂存一些数据.我必须为多个表执行此操作.当我测试并行运行两个数据流时,同时将两个数据流聚集在一起.但是我不确定这是将表的负载分配到几个数据流中还是将所有转换都包含在一个数据流中的最佳方法

I am trying to stage some data with a not exists transformation. i have to do it for multiple tables. when i test ran two data flows in parallel the clusters were brought up together for both the data flows simultaneously. But I am not sure if this the best approach to distribute the loading of tables across couple of data flows or to have all the transformations in one data flow

推荐答案

1:如果并行执行管道中的数据流,则ADF将根据附加的Azure集成运行时中的设置为每个旋转单独的Spark群集.每个活动.

1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.

2:如果将所有逻辑放在单个数据流中,那么它将全部在单个Spark集群实例上的同一作业执行上下文中执行.

2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.

3:另一个选择是在管道中串行执行活动.如果您在Azure IR配置上设置了TTL,则ADF将重用计算资源(VM),但对于每次执行,您仍将使用全新的Spark上下文.

3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.

所有方法都是有效的做法,应根据对ETL流程的要求来选择哪种做法.

All are valid practices and which one you choose should be driven by your requirements for your ETL process.

不.3可能需要最长的时间来执行端到端的操作.但这确实在每个数据流步骤中提供了清晰的操作分离.

No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.

不.在逻辑上遵循2可能会更困难,并且不会给您太多可重用性.

No. 2 could be more difficult to follow logically and doesn't give you much re-usability.

不.1确实与#3相似,但是您可以并行运行它们.当然,并非每个端到端的进程都可以并行运行.您可能需要先完成数据流再开始下一个操作,在这种情况下,您将返回到#3串行模式.

No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.

这篇关于多个数据流与所有转换合而为一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆