在ETL中进行分阶段:最佳实践? [英] Staging in ETL: Best Practices?

查看:95
本文介绍了在ETL中进行分阶段:最佳实践?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我使用的体系结构采用了一些数据源,因为其中的数据源托管在云中,因此这些数据源是在本地暂存的。无论如何,其他都在本地托管,因此我执行的ETL直接从源头获取。我看不出为其他来源创建舞台的意义。

Currently, the architecture I work with takes a few data sources out of which one is staged locally because it's hosted in the cloud. The others are hosted locally anyway, so the ETL I perform takes it directly from the source. I don't really see the point in creating a stage for the other sources.

1)将本地托管的源复制到本地阶段是否有明显的好处?

1) Is there a distinct benefit to duplicating the locally hosted source into a local stage?

2)将舞台托管在单独的计算机上还是与Warehouse托管在同一台计算机上是个更好的主意?

2) Is it a better idea to host the stage on a separate machine or the same one as the Warehouse?

3)如果我想减少ETL时间,那是什么?这样做的好方法?我正在考虑对数据进行分区,以便比存档数据更频繁地提取重要信息。这是一个好方法吗,我有什么选择?

3) If I'm trying to reduce my ETL time, what's a good way to do so? I was considering partitioning my data so that the important information is pulled more frequently than the "archived data". Is this a good approach, and what are my alternatives?

推荐答案

@omgitsdev我想澄清一些概念。

@omgitsdev There are a few concepts I would like to clarify.

您的文件可以托管在任何地方-本地或云中
文件被加载到一个临时表中,以加载到您的数据仓库中。这个过程称为暂存。

Your files can be hosted anywhere - locally or on cloud The files are loaded into a temporary table to be loaded into your Data Warehouse. This process is called staging.

从概念上讲,您可以将暂存区放在任何地方;但是,为了减少连接性问题,我们在同一数据库中创建了一个单独的架构并进行了阶段化。

Conceptually you can have your staging area anywhere; however to reduce connectivity issues, we create a separate schema in the same database and stage them. This will ensure that your performance is not hampered by connectivity issues.

通常,您将事实表按保存日期的列进行分区;

you generally partition your fact table by the column which holds the date; this is easier and also the most recent partitions hold the latest data;

基于卷,您可以将其设为每月,每季度或每年的分区;在某些情况下,我们还会创建每日或每小时分区。

Based on the volume, you either make it a monthly, quarterly or yearly partition; there are situations where we also create daily or hourly partitions.

通过确保临时表与数据仓库表位于单独的磁盘中,还可以提高性能。

Your performance can also be accelerated by ensuring that the staging tables are in a separate disk from the data warehouse tables.

这篇关于在ETL中进行分阶段:最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆