在数据库中发现事实上的外键的工具? [英] Tools for discovering de facto foreign keys in databases?
问题描述
在数据库中快速查看信息的一个好方法是应用一个自动创建所有表的数据库图表的工具。我的经验,这样的工具使用外键作为关系,我尝试的大多数数据库都不包含这些关系。当然,它们满足对应于外键的约束,但是不强制执行它们。我将最终得到一个由一堆不相关的表组成的图表。
所以我在找的是能够计算未声明的外键和
- 在数据库图表中使用它们作为表关系,或者
- 生成SQL代码对于相应的外键声明
你知道有什么工具,如果可能的话,可以免费的吗? $ b
有趣的问题。您正在寻找解析数据库模式和数据,以确定哪些表是相关的或应该相互关联,没有任何严格的关系定义。实际上,你试图推断一个关系。
我看到两种方法可以推断出这种关系。首先让我说,你的方法可能会有所不同,取决于你正在使用的数据库。一些问题值得思考(我不想要答案,但值得反思)
$ b $ ul
这种推论几乎肯定会给出错误的结果,并建立在很多假设上。所以我提供了两种我将一致使用的方法。 / p>
通过结构/命名(符号分析)推断关系
通用数据库设计是在表名之后(例如 CustomerId
表 Customer
)命名PK列,或者也可以命名PK列只需 Id
。
与另一个FK关系的表经常将其相关列命名为相关表。在 这种类型的分析将包括: Order
表中我希望一个 CustomerId
列引用 CustomerId
Customer
表中的code> / Id
列。
FirstCustomerId
& SecondCustomerId
都引用 Customer $>中的
CustomerId
c $ c
b通过数据推断关系(统计分析)
查看数据,正如您建议您在评论中所做的那样,将允许您确定可能引用。如果 Order
表中的 CustomerId
列包含 Id中不存在的值
列,那么有理由质疑这是一个有效的关系(尽管你永远不知道!) Customer
表中的
一个简单的数据分析形式是使用日期和时间。创建彼此靠近的行更有可能彼此相关。如果对于创建的每个 Order
行,在几秒钟之内还存在2到5 Item
行,那么两者之间的关系是可能的。
更详细的分析可能会查看使用的值的范围和分布。
例如,如果你的 Order
表有一个 St_Id
列 - 你可以推断使用符号分析该列可能与 State
表或 Status
表有关。 St_Id
列有6个离散值,90%的记录被2个值覆盖。 State
表有200行, Status
表有9行。你可以很合理地推断出 St_Id
列与 Status
表相关 - 它给出了更多的行覆盖(2/3的行是'used',而只有3%的行在 State
表中被使用)。
如果您对现有数据库执行数据分析以收集真实生活数据,我希望可以使用一些模式作为结构推理的指南。当一个有大量记录的表有一个重复很多次(不一定按顺序)的值很少的列时,这个列更可能涉及一个相对较少的行数的表。
总结
祝您好运。这是一个感兴趣的问题,我刚刚提出了一些想法,但这是一个很大的尝试&错误,数据收集和性能调整情况。
A good way to quickly survey the information in a database is to apply a tool that automatically creates a database diagram of all tables and all relationships between them.
In my experience, such tools use foreign keys as the relationships, which most of the databases I try them do not contain. Sure, they satisfy constraints corresponding to foreign keys, but do not enforce them. And I'll end up with a 'diagram' consisting of a bunch of unrelated tables.
So what I'm looking for is software that can compute "undeclared foreign keys" and either
- uses them as table relations in a database diagram, or
- generates SQL code for corresponding foreign key declarations
Do you know any tools, free if possible, that can already do this?
Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.
I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)
- are these in-house enterprise systems that follow some consistent naming convention or pattern?
- or are they 'in-the-wild' databases that you come across anywhere, at any time?
- what sort of assumptions are you prepared to make?
- would you prefer to get more false positives or false negatives in your result?
Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.
So I offer two approachs that I'd use in concert.
Inferring a relationship through structure / naming (symbolic analysis)
Common database design is to name a PK column after the table name (e.g. CustomerId
on table Customer
), or alternatively name the PK column simply Id
.
A table with a FK relationship to another often names its related column the same as the related table. In the Order
table I'd expect a CustomerId
column which refers to the CustomerId
/ Id
column in the Customer
table.
This type of analysis would include
- inspecting columns across tables for similar phrases / words
- looking for columns names that are similar to the names of other tables
- checking for column names that contain the name of other column (e.g.
FirstCustomerId
&SecondCustomerId
both refer to theCustomerId
column in theCustomer
table)
Inferring a relationship through data (statistical analysis)
Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the CustomerId
column in the Order
table contains values which don't exist in the Id
column of the Customer
table then it's reasonable to question that this is a valid relationship (although you never know!)
A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every Order
row that was created, there also exist between 2 and 5 Item
rows created within a few seconds, then a relationship between the two is likely.
A more detailed analysis might look at the range and distribution of used values.
For example, if your Order
table has a St_Id
column - you might infer using symbolic analysis that the column is likely to relate to either a State
table or a Status
table. The St_Id
column has 6 discrete values, and 90% of the records are covered by 2 values. The State
table has 200 rows, and the Status
table has 9 rows. You could quite reasonably infer that the St_Id
column relates to the Status
table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the State
table would be used).
If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.
In summary
Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.
这篇关于在数据库中发现事实上的外键的工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!