在Apache Hive的维度表中使用代理键或nk + effective_time是否更好 [英] Is it better to have a surrogate key or nk+effective_time in dimension tables in apache hive

查看:302
本文介绍了在Apache Hive的维度表中使用代理键或nk + effective_time是否更好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以说,这里有一个SCD2尺寸表-位置。自然键是国家,州和城市的总和。由于它是SCD2表,因此eff日期也是键的一部分。

Lets say, there is a SCD2 dimension table - location. The natural key is country, state and city combined. Since it is SCD2 table, eff date is also part of the key.

最好将替代键设置为usavirginarichmond20110101或使用row_number()创建实际的数字键

Is it better to have the surrogate key as usavirginarichmond20110101 or create an actual numerical key using row_number() in hive?

为什么一种方法优于另一种方法?

Why one approach is better over another?

推荐答案

(术语说明:自然键的组合称为复合键,而不是替代键,它仍然是自然键。替代键(又称合成键)是无业务意义的连续整数)。

(Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka Synthetic key) is a sequential integer that has no business meaning).

简短答案:由于您的尺寸为SCD2,因此请务必使用代理键/合成键。使用自然/复合键处理SCD是很痛苦的。

Short answer: since your dimension is SCD2, definitely use surrogate/synthetic keys. Handling SCD with natural/composite keys is a pain.

更长的答案:
代理(SK)与自然键(NK)设计是一个持续不断的争论。每个都有优点和缺点。我的方法是始终使用数据仓库(DW)中的代理密钥。这意味着需要做一些额外的ETL工作,但这是可以接受的成本,因为代理键具有一些重要的优点:

Longer answer: Surrogate (SK) vs Natural keys (NK) design is an on-going debate. Each has pros and cons. My approach is to always use surrogate keys in data warehouse (DW). It means some extra ETL work, but that's an acceptable cost because surrogate keys have some important advantages:


  1. SCD处理要容易得多。如果您有SCD,则使用自然键相当麻烦且难看。合成键没有问题;

  1. SCD handling is much easier. If you have SCDs, using natural keys is rather cumbersome and ugly. Synthetic keys don't have the problem;

系统范围的一致性:由于SCD,您很有可能必须在数据仓库中使用SK至少在某些表中。然后在所有表中始终使用它们是有意义的。混合使用SK和NK设计很丑陋;

System-wide consistency: because of SCD, it's highly likely that you will have to use SKs in your Data Warehouse at least in some tables. It makes sense then to consistently use them in all tables. Mixing SK and NK designs is ugly;

复合NK通常可以是大而复杂的字母数字字符串。这意味着它们可能会大大增加表的大小,而连接可能会变慢。 SK是一个简单的整数,具有可预测的大小和一致的连接速度;

Composite NKs can often be large and complex alpha-numeric strings. It means that they might substantially increase table sizes, and joins might be slower. SK is a simple integer, with predictable size and consistent join speed;

NK可能是DW中错误和不稳定的来源。例如,某些数据库重复使用其自然键,结果其含义可能会随时间而改变。在依赖NK的DW中,这是潜在的灾难。同样,NK可能来自多种来源,并导致集成冲突。

NKs can be a source of bugs and instability in DW. For example, some databases re-use their natural keys, and as a result their meaning might change over time. In DW that relies on NKs that's a potential disaster. Also, NKs might come from a wide variety of sources, and lead to integration conflicts.

还有其他注意事项,但以我的经验,系统地使用代理密钥使DW设计更加可靠和高效。

There are other considerations, but in my experience, systematically using surrogate keys makes DW design more reliable and efficient.

这篇关于在Apache Hive的维度表中使用代理键或nk + effective_time是否更好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆