ETL操作-返回主键 [英] ETL Operation - Return Primary Key

查看:114
本文介绍了ETL操作-返回主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Talend填充数据仓库。我的工作是将客户数据写入维度表,将交易数据写入事实表。事实表上的代理键(p_key)是自动递增的。插入新客户时,我需要我的事实表以反映相关客户的ID。



正如我提到的,我的p_key是自动auto_incrementing,所以我不能只是为p_key插入任意值。



是否曾考虑过如何在维度表中插入一行并仍然检索要在我的事实记录中引用的主键?



更多信息:



如果输入数据未规范化怎么办?例如,我有一个csv,其中包含以下数据:

  order#日期总客户#first_name last_name 
111 1 / 2/2010 500101 John Smith
222 1/3/2010 600101 John Smith

显然,我希望客户信息显示在维度表中,而交易数据显示在事实表中:

 维度
101约翰史密斯

事实
111 1/3/2010
222 1/3/2010

正如您提到的,尺寸表的键将自动递增。事实表需要引用此密钥。您如何设计etl作业,以便在插入后返回代理键?



此外,如果对客户数据进行了重复数据删除(如上所述),您将如何处理

解决方案

我可能误解了您的问题,但是:


  1. 事实表可能具有也可能没有自动递增的PK,通常事实表中的PK是由多个引用维表的FK组成的。


  2. 维度表应具有自动递增的PK。


  3. 新客户应进入客户交易事实到达DW之前的维度表(或至少是事实表)。


  4. 维度表应具有唯一标识客户的BusinessKey-例如电子邮件,全名+个人识别码或类似名称。


  5. 传入交易行也应具有客户BusinessKey字段-这就是我们识别客户的方式。 / p>


  6. 使用在将交易插入事实表之前,需要从客户维度表中查找客户PrimaryKey的BusinessKey。


编辑



如果您的新客户数据与交易捆绑在一起,请找到一种方法来提取客户数据,并在交易之前将其路由到DW。



更新:



首先加载dimCustomer,决定使用BusinessKey,这样维将如下所示:

  CustomerKey = 12345(自动递增)
CustomerBusinessKey = john_smith_101(必须唯一地标识John Smith)
CustomerFirstName = John
CustomerLastName = Smith

在维度加载过程中,您必须将传入的行分隔为两个流,即现有和新客户。来自现有客户流的行将更新暗表(类型1 SCD),同时将来自新客户流的行插入。插入的行流中不应有重复项;您可以通过将它们插入到临时表中并在最终插入尺寸表之前删除重复项来实现。您还可以提取重复项并将其路由回加载过程以更新客户记录;它们可能包含较新的数据-例如更新的电话号码或类似数据。



客户进入后,请加载事实。



事实表应类似于:

  DateKey(PK )
CustomerKey
OrderNumber(PK)
总计

我有使用DateKey和OrderNumber的复合主键,允许不时重置订单号序列。



在加载过程中,修改事实记录以使其看起来像例如:

  DateKey CustomerBusinessKey OrderNumber总计
20100201 john_smith_101111500
20100301 john_smith_101 222600

这时,我们需要使用维度表中的CustomerKey将CustomerBusinessKey替换为CustomerKey。因此,在查找之后,流看起来像:

  DateKey CustomerKey OrderNumber总计
20100201 12345 111 500
20100301 12345 222 600

现在可以将其插入事实表。



我也作了些欺骗-没有从dimDate查找日期键,也没有在事实表中查找现有的行。加载事实表时,可以在加载之前查找现有的(DateKey,OrderNumer)组合,也可以将其留给主键来保护重复项-您可以选择。无论如何,请确保将相同数据重新加载到事实表的尝试失败。


I am using Talend to populate a data warehouse. My job is writing customer data to a dimension table and transaction data to the fact table. The surrogate key (p_key) on the fact table is auto-incrementing. When I insert a new customer, I need my fact table to reflect the id of the related customer.

As I mentioned my p_key is auto auto_incrementing so I can't just insert an arbitrary value for the p_key.

Any thought on how I can insert a row into my dimension table and still retrieve the primary key to reference in my fact record?

More info:

What if the incoming data isn't normalized? For instance I have a csv with the following data:

order #   date        total customer# first_name last_name
111       1/2/2010    500    101      John        Smith     
222       1/3/2010    600    101      John        Smith

Obviously, I want the customer info to appear in the dimension table, and the transaction data in the fact table:

dimension
101  john smith

fact
111       1/3/2010
222       1/3/2010

As you mentioned, the key of the dimension table will be auto incrementing. The fact table needs to reference this key. How do you design the etl job so that the surrogate key is returned after an insert to ?

Also, if the customer data is deduped (as above) how do you handle the keys?

解决方案

I may have misunderstood you problem, however:

  1. A fact table may or may not have an auto-incrementing PK, usually a PK in a fact table is a composite of several FKs referencing dimension tables.

  2. A dimension table should have an auto-incrementing PK.

  3. A new customer should "land" into the customer dimension table before the transaction fact reaches the DW (or at least the fact table).

  4. A dimension table should have a BusinessKey which uniquely identifies a customer -- like email, full name + pin, or similar.

  5. An incoming transaction row should have the customer BusinessKey field too -- that's how we identify the customer.

  6. Use the BusinessKey to lookup the customer PrimaryKey from the customer dimension table before inserting the transaction into the fact table.

EDIT

If your new customer data is bundled with the transaction, find a way to extract customer data and route it to the DW ahead of the transaction.

UPDATE:

Load dimCustomer first, decide on BusinessKey -- so the dimension would look like:

CustomerKey = 12345 (auto-incremented)
CustomerBusinessKey = john_smith_101 (must uniquely identify the John Smith)
CustomerFirstName = John
CustomerLastName = Smith

During dimension loading process, you have to segregate incoming rows int two streams, existing and new customers. Rows from the "existing customer" stream update the dim table (type 1 SCD), while rows from the "new customer" stream are inserted. There should be no duplicates in the stream of rows that are being inserted; you can accomplish this by inserting them into a staging table and removing duplicates there, just before the final insert into the dimension table. You can also extract duplicates and route them back into the loading process to update customer records; they may contain newer data -- like updated phone numbers or similar.

Once the customer is in, load facts.

The fact table should look something like:

DateKey     (PK)
CustomerKey
OrderNumber (PK)
Total

I have used composite primary key of the DateKey and the OrderNumber, allowing for the order number sequence to reset from time to time.

During loading process, modify the fact record to look something like:

DateKey CustomerBusinessKey OrderNumber Total
20100201  john_smith_101       111       500
20100301  john_smith_101       222       600

At this point we need to replace the CustomerBusinessKey with the CustomerKey from the dimension table using a lookup. So, after the lookup the stream would look like:

DateKey CustomerKey OrderNumber Total
20100201 12345       111         500
20100301 12345       222         600

This can now be inserted into the fact table.

I have also cheated a bit -- did not lookup a date key from the dimDate, and did not look for existing rows in the fact table. When loading the fact table, you can look for existing (DateKey, OrderNumer) combination before loading, or you can leave it up to the primary key to protect agains duplicates -- your choice. In any case make sure that an attempt to re-load the same data into the fact table fails.

这篇关于ETL操作-返回主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆