什么是将重复的行信息集组合到进行数据库规范化时调用的新实体中? [英] What is combining repeating sets of row information into new entities called when doing database normalization?

查看:17
本文介绍了什么是将重复的行信息集组合到进行数据库规范化时调用的新实体中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对某个数据库规范化有点困惑,想问问 StackOverflow:

I'm a bit confused about a certain piece of database normalization and thought I'd ask StackOverflow:

假设您有以下将产品与颜色相关联的关系.请注意,产品 1 和产品 2 都使用相同的颜色集(蓝色和绿色).

Imagine you have the following relations that relate products to colors. Notice that Product 1 and Product 2 both use the same set of colors (Blue and Green).

Product_Color                         Color
+-------------+-------------+     +-------------+-------------+
| Product*    | Color*      |     | ColorId*    | Name        |
+-------------+-------------+     +-------------+-------------+
| 1           | 1           |     | 1           | Blue        |
| 1           | 2           |     | 2           | Green       |
| 2           | 1           |     +-------------+-------------+
| 2           | 2           |
+-------------+-------------+

如果我创建两个新关系,ColorSet 和 ColorSet_Color,我可以通过将这 4 个关系连接在一起来显示相同​​的信息.

If I create two new relations, ColorSet and ColorSet_Color, I can display the same information by joining the 4 relations together.

Product_ColorSet:                 ColorSet_Color:             
+-------------+-------------+     +-------------+-------------+
| Product*    | ColorSetId* |     | ColorSetId* | ColorId*    |
+---------------------------+     +-------------+-------------+
| 1           | 1           |     | 1           | 1           |
| 2           | 1           |     | 1           | 2           |
+-------------+-------------+     +---------- --+-------------+

ColorSet:                         Color:
+-------------+                   +-------------+-------------+
| ColorSetId* |                   | ColorId*    | Name        |
+-------------+                   +-------------+-------------+
| 1           |                   | 1           | Blue        |
| 2           |                   | 2           | Green       |
+-------------+                   +----------[--+-------------+

在这一点上,如果我有一个大型 Product_Color 表,并且具有合理程度的共享颜色组,那么从空间的角度来看,我将受益匪浅.

At this point if I had a large Product_Color table, with a reasonable degree of shared groups of colors, I would stand to gain considerably from a space perspective.

这个操作在数据库规范化上下文中的技术名称是什么?即使我创建的实体实际上并不存在,我也清楚地删除了冗余信息,这更像是存在大量重叠的随机机会.我这样做具体改变了什么?

What is the technical name for this operation in the context of database normalization? I'm clearly removing redundant information even though the entity I've created doesn't actually exist, it's rather more just random chance that there is a lot of overlap. What specifically am I changing by doing this?

此外,我似乎可以对大多数实体随意执行此操作.让我感到困惑的是,当我们开始练习时,Product_Color 和 Color 已经处于第 6 范式(对吗?).

Furthermore, it seems like I could arbitrarily do this to most entities. What puzzles me is that Product_Color and Color are already in 6th normal form when we started the exercise (right?).

推荐答案

您正在引入surrogate key"(或标识符)来名称/识别产品出现的颜色集.替代方案通常被认为是自然键"(或标识符).(尽管不同的人在细节上使用这些术语的方式不同.例如,当名称/标识符被永久分配给所指对象和/或它的所指对象的唯一名称/标识符和/或仅在数据库中可见时,有些人可能只使用代理"; 不是应用程序.例如,有些人会说外部可见的系统生成的任意名称/标识符(如驱动程序标识号)既是代理又是自然的.)

You are introducing a "surrogate key" (or identifier) to name/identify sets of colours that products come in. The alternative is usually considered to be a "natural key" (or identifier). (Although different people use these terms differently in detail. Eg some might only use "surrogate" when a name/identifier is assigned a referent permanently and/or is its referent's only name/identifier and/or it is visible only withing the database & not the application. Eg some would say that an externally visible system-generated arbitrary name/identifier like a Driver Identification Number is both a surrogate and natural.)

代理键通常被称为无意义(标识符)".这反映了思维混乱.所有不是由先验命名方案生成的名称都是毫无意义的"&随意的.尼古拉斯"在被选中之前并不意味着";被选中,它意味着"你.这适用于任何名称/标识符.所以无意义"/有意义"不是一个有用的区别.系统中的代理名称/标识符只是在系统启动后选择的. 在系统中被称为有意义"[sic] 的东西在分配时会被称为无意义"[sic]在之前存在的任何系统中(因为分配是在 it 开始之后).

Surrogate keys are often called "meaningless (identifiers)". This reflects muddled thinking. All names not generated by an a priori naming scheme are "meaningless" & arbitrary. "Nicholas" did not "mean" you until it was chosen; having been chosen, it "means" you. This goes for any name/identifier. So "meaningless"/"meaningful" is not a helpful distinction. A surrogate name/identifier in a system is just one that got chosen after the system started. What gets called "meaningful" [sic] in a system would have been called "meaningless" [sic] when assigned in whatever system existed before (since assignment was after it started).

有一种观点",您可以删除冗余信息",但这不是规范化解决的那种冗余.您正在用其他表替换一个表,但这不是规范化分解.代理的引入不是规范化的一部分.规范化不会引入新的列名.它只是在替换它的表中重用原始表的名称.(您能否清楚准确地描述此处的冗余"是什么意思?)

There is a "perspective" in which you are "removing redundant information", but it's not the kind of redundancy that normalization addresses. You are replacing a table by other tables, but it's not normalization decomposition. Introduction of surrogates is not part of normalization. Normalization does not introduce new column names. It just reuses an original table's names in the tables that replace it. (Are you able to clearly and exactly describe just what you mean by "redundant" here?)

有时人们认为如果相同的值子组可以在列集或表中出现多次,那么这些子行值需要替换为 FK 的 id 到将 id 值映射到子行值的新表.(甚至对于单列子行,即当一个值在列或表中出现不止一次时.)他们认为多个子行值出现是冗余"的,或者只有 id 可以重复而不是冗余".(id 设计被看作是对原始数据的一种压缩.)他们可能认为这是规范化的一部分.这一切都不是.

Sometimes people think that if the same subtuple of values can appear more than once in a column set or table then those subrow values need to be replaced by ids that are FKs to a new table that maps id values to subrow values. (Maybe even for single-column subrows, ie when a single value appears more than once in a column or table.) They think that multiple subrow value appearances are "redundant" or that only ids can repeat without being "redundant". (The id design is seen as a kind of data compression of the original.) They may think that this is part of normalization. None of this is so.

这不是您应该通过表格设计解决的冗余.如果您知道 DBMS 对您的表的实现选项并且您知道您的应用程序的使用模式并且您知道原始表显然是并且比某些碰巧不那么冗余"的选项更糟糕(为什么更冗余"的选项不会更好?)那么你应该告诉 DBMS 你的设计想要什么选项如果可以,无需更改架构.(这通常通过索引和/或视图完成.)例如,在 ColorId 上索引原始 Product_Color 会导致实现中的结构与您在第二个设计中手动创建的结构基本相同,但会自动生成和管理.(您可能出于其他原因引入代理,例如用更简洁但更模糊的值和约束的外键替换多列外键.)

This is not redundancy that you should bother to address via table design. If you know the implementation options of your tables by the DBMS and you know the usage patterns of your application and you know that the original is demonstrably and meaningfully worse than some option that happens to be "less redundant" (and why wouldn't a "more redundant" option be better?) then you should tell the DBMS what option you want for your design without changing the schema if you can. (This is typically done via indexes and/or views.) Eg indexing your original Product_Color on ColorId leads to essentially the same structure in the implementation as you have created by hand in your second design, but automatically generated and managed. (You might introduce surrogates for other reasons, eg to replace multiple-column foreign keys by more concise although more obscurely valued and constrained ones.)

重新选项:您的新设计将在查询文本和(对于典型的 DBMS 实现)执行中使用更多的操作(例如连接和投影)而不是原始设计(例如查询原始表)其他地方更少(例如,将一种产品的颜色集复制到另一种产品中).所以这又是关于多个视角"的权衡.

Re options: Your new design will use more operations (eg joins and projections) in query text and (for typical DBMS implementations) execution than the original (eg to query for the original table) but fewer elsewhere (eg in copying one product's colour set to another's). So again it is all about tradeoffs of multiple "perspectives".

事实上,你在另一个意义上引入了冗余与代理.还有一些额外的列包含一堆 id 值,这些值不是原始的,但记录了相同的情况.您还给用户带来了更多命名和间接设计的负担.与原版相比,代理设计在这个视角"中肯定有很多冗余信息".

In fact you have in another sense introduced redundancy with the surrogates. There are additional columns holding a bunch of id values that are not in the original yet that record the same situations. You have also burdened the user with a design with more naming and indirections. The surrogate design certainly has a lot of "redundant information" in this "perspective" compared to the original.

甚至您的初始设计也可能引入了代理,即颜色名称的颜色 ID.(如果颜色 id 添加了信息",即通知"您而不仅仅是它们的关联名称,那么它们就不是替代品并且是必要的.)即,如果颜色 id 是任意选择的,那么您可以只需要:

Even your starting design has probably introduced surrogates, namely colour ids of colour names. (If colour ids added "information", ie "informed" you beyond just their associated names, then they would not be surrogates and would be necessary.) Ie if colour ids are chosen arbitrarily then you could just have:

Product_Color
+-------------+-------------+
| Product*    | ColorName*  |
+-------------+-------------+
| 1           | Blue        |
| 1           | Green       |
| 2           | Blue        |
| 2           | Green       |
+-------------+-------------+

您应该有一个理由来引入颜色 ID,就此而言是产品 ID,而不是已经存在的自然键.你能证明你的多个表、名称和间接引用与只有一个吗?

You should have a reason to introduce colour ids, and for that matter product ids, rather than natural keys already existing. Can you justify your multiple tables, names and indirections vs just one?

这篇关于什么是将重复的行信息集组合到进行数据库规范化时调用的新实体中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆