管理数据仓库中的代理密钥 [英] Managing surrogate keys in a data warehouse

查看:78
本文介绍了管理数据仓库中的代理密钥的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想建立一个数据仓库,我想使用代理键作为我的事实表的主键.但是问题在于,就我而言,事实表应该进行更新.

I want to build a data warehouse, and I want to use surrogate keys as primary keys for my fact tables. But the problem is that in my case fact tables should be updated.

第一个问题是如何为源系统中的自然键找到相应的自动生成的替代键?我看到了一些提及查询表的答案,这些查询表存储了自然键和代理键之间的对应关系,但是我不明白它们是如何实现的.该表应存储在哪里:数据仓库本身还是其他地方?

The first question is how do I find a corresponding auto-generated surrogate key for the natural key in the source system? I have seen some answers mentioning lookup tables which store correspondence between natural and surrogate keys, but I didn't understand how exactly they are implemented. Where this table should be stored: in the data warehouse itself, or somewhere else?

还有第二个问题.源系统已经包含事实的替代键,但是它们具有16字节的UUID数据类型.而且事实的数量极不可能超过最大整数值(4个字节).我应该使用源系统提供的UUID来简化ETL,还是应该执行更复杂的ETL并实现自己的整数代理键以提高性能?

There is also a second question. The source system already contains surrogate keys for facts, but they have UUID data type which is 16 bytes. And the number of facts is very unlikely to exceed maximum integer value (4 bytes). Should I use UUIDs provided by the source system to simplify ETL, or should I do more complex ETL and implement my own integer surrogate keys for better performance?

推荐答案

我认为其余的都已经回答了.我会给您2美分,用于管理和维护代理密钥.

I think the rest is answered already. I'd give you my 2 cents about managing and maintaining surrogate keys.

在Teradata期间,我经常使用代理密钥.这是我多年来学习的有关代理键的一些最佳实践.

I worked with surrogate keys a lot during my time at Teradata. Here are a few best practices I learned over the years about surrogate keys.

  1. 您只能从批准的主来源(在您的情况下使用特定的API.应该允许的API数量不多生成相同的域值.选择一个充当主人您的域名价值.例如客户编号通常来自CRM系统,不太可能从计费系统作为主系统)
  2. 您生成&将它们存储在确实的查找表中(让我们称之为Customer_SGK).通常,这些代理键表不属于您最终的LDM/PDM可以采用惯用的方法或万向接头的方法.这些驻留在同一数据库服务器中,而不是技术服务器中元数据架构.我们将该架构称为"My_Tec_Schema"
  3. 在这样的Lookup表中,您将具有代理键列(例如Customer_ID),每个主源的源自然键列(source1_customerNO,source2_customerNO)和一个时间戳来保持生成此密钥的时间的踪迹.
  4. 您的PK是Customer_ID,在此列中可能不是唯一的,因此,根据所使用的数据存储技术,您可能必须将其分类为唯一或非唯一主索引/密钥(例如,在Teradata中,它将是NUPI).
  5. li>
  6. 有时,您必须允许它简化您的ETL流程,为来自两个不同的自然键加载相同的客户ID2个不同的源系统,但它们都意味着相同的客户.

  1. You generate surrogate keys only from an approved master source (in your case a particular API. Not many APIs should be allowed to generate the same domain values. Pick the one that acts as master for your domain values. e.g. Customer No is usually coming from CRM systems and not likely from billing systems as a master)
  2. You generate & store these in indeed a lookup table (lets call it Customer_SGK). Generally these surrogate key tables are not part of your final LDM/PDM in either inmon or kimbal approaches. These reside within the same database server but rather in a technical metadata schema. Let's call that schema "My_Tec_Schema"
  3. In such a Lookup table you would have the surrogate key column (e.g. Customer_ID), source natural key column(s) per each master source (source1_customerNO, source2_customerNO) and a timestamp to keep a trail of when this key was generated.
  4. Your PK is Customer_ID which may not be unique in this column so depending upon data storage technology used you may have to classify it as Unique or NonUnique Primary Index / Key (for instance in Teradata it would be a NUPI).
  5. You sometimes have to allow this to ease your ETL processes while loading same Customer ID for two different natural keys coming from 2 different source systems but they both mean the same customer.

具有此查找表,您需要加载它(生成键)从您的舞台表/获取ETL中的第一件事流程.然后,您可以从舞台上加载左外部联接与查找"表以获取您的代理密钥并将其加载到事实表中并希望您也可以使用自然键.(您一直想拥有它们因为大多数情况下,您会在事实表中找到一些孤儿,唯一的快速&恢复这种情况的可靠方法是您的事实表中的自然键,并使用PK或PI或索引基于更新的操作非常快速,而不是全表扫描)

Having this lookup table, you would want to load it (generate keys) from your stage tables / sources the first thing in your ETL processes. Then you load from your stage Left Outer Join with Lookup table to get your Surrogate Key and load that into your fact table and hopefully also your natural keys. (you always want to have them because most often you will get some orphans in your fact tables and the only fast & reliable way to recover that situation is to have your natural keys in your fact table and to use PK or PI or an Index based Update operation which is very quick rather than full table scans)

我可以继续使用代理键.请阅读此高级概述后提出任何具体问题.我很乐意提供帮助.

I can go on and on on Surrogate Keys. Please ask any specific question having read this high level overview. I'd be glad to help.

这篇关于管理数据仓库中的代理密钥的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆