我怎么知道哪个更合适的数据库设计? (作者,文章和评论) [英] How do I know which is the more appropriate database design? (Authors, Articles & Comments)

查看:66
本文介绍了我怎么知道哪个更合适的数据库设计? (作者,文章和评论)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们假设数据库包含三个表:作者文章评论

Let's assume a database with three tables: Author, Articles, Comments

假定关系如下:

Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article

如果我想知道哪个作者撰写的评论最多的文章,则需要首先选择属于特定作者的所有文章。然后,我可以计算在每篇文章下发布的评论数。通常会导致更复杂的查询。

If I want to know which Author writes the most commented article, I need to select all Articles that belong to a specific Author first. Then I can count the number of comments that were posted under each of those articles. Which in general leads to more complex queries.

如果关系如下:

Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article
**Comment belongs to one Author of the relevant Article**

然后,我可以直接选择并计算在特定作者的文章下发布的所有评论,而不必担心在查询中包括文章。

then I could directly select and count all comments that were posted under the articles of a specific Author, without bothering about including the articles in the query.

但这意味着多余的关系。

But it implies a redundant relationship.

性能,可用性和编码最佳实践,哪种方法更好?

In view of performance, usability and coding best practices, which approach is the better one?

我记得读过某个地方,应该只使用第一种方法,避免重复关系。但我不记得在哪里或为什么。

I remember to have read somewhere, that one should only use the first approach, and avoid redundant relationships. But I don't remember where or why. What is a link to a scientific approach to answer this question?

推荐答案

您的第一种方法是标准化设计。它应该是默认值-更易于维护,更少出错,并且总体上需要更少的代码。

Your first approach is a normalized design. It should be the default - it's more maintainable, less error-prone, and requires less code overall.

第二个选项是非规范化设计。如果您考虑透彻,则要求您每次有人发表评论时都为该文章找到作者,并增加评论字段;那可能是更多的代码,并使编写注释的速度变慢。这也意味着您的创建评论代码中的一个简单错误可能会破坏应用程序逻辑,并且您可能需要为每个评论写操作创建一个事务,以便可以保证评论和更新为 authors.comment_count成功或失败。

The second option is a denormalized design. If you think it through, it would require you to find the author for the article every time someone posts a comment, and increment the "comments" field; that's probably more code, and makes writing the comment slower. It also means a simple bug in your "create comment" code could break the application logic, and you probably need to create a transaction for each comment "write" action so you can guarantee that both the comment and update to "authors.comment_count" succeeds or fails.

因此,第二种选择肯定更复杂,写评论的速度也较慢。它 的查询速度可能更快,但是当您要加入主键时,几乎可以肯定的是,直到数据库规模达到数亿条记录时,您才能够衡量这种性能影响。

So, the second option is definitely more complex, and slower for writing comments. It may be faster for querying, but as you'll be joining on primary keys, you will almost certainly not be able to measure that performance impact until you get to a database size of hundreds of millions of records.

通常,我建议采用以下方法;仅在前面的步骤没有给您足够的性能时,才执行每个步骤。

In general, I recommend the following approach; take each step only if the previous steps haven't given you enough performance.


  • 设计一个关系模型。

  • 调整关系数据库(索引等)

  • 改善硬件-RAM,CPU,SSD磁盘等。

  • 创建一个测量平台,以便您确定性能挑战并进行实验。根据当前和预期的数据大小创建基准;找到一种方法,用虚拟数据填充测试装备,直到拥有需要扩展的数据量为止。

  • 在测试设备上运行查询。确保索引或查询优化没有进一步的性能调整。

  • 引入应用程序级缓存。在您的示例中,可以将作者的评论数量缓存1小时是可以接受的。

  • 取消规范化架构。使用测试装备来证明它可以提供期望的性能。

  • 查看更多奇特的数据解决方案-分片,数据分区等。

  • design a relational model.
  • tune that relational database (indexes, etc.)
  • improve the hardware - RAM, CPU, SSD disks etc.
  • create a measurement rig so you can identify the performance challenges and run experiments. Create benchmarks based on current and expected data sizes; find a way to fill your test rig with dummy data until you have the data volume you need to scale to.
  • run your queries on the test rig. Make sure there are no further performance tweaks from indexing or query optimization.
  • introduce application-level caching. In your example, caching the number of comments for an author for 1 hour may be acceptable.
  • de-normalize your schema. Use your test rig to prove it gives you the performance you expect.
  • look at more exotic data solutions - sharding, data partitioning etc.

非正规化之所以走得很远,是因为它带来了真正的维护风险,使代码很多更加复杂,并且远不及增加额外的4GB有效在大多数情况下都连接到您的服务器。

Denormalization is so far down the line because it introduces real maintenance risks, makes your code much more complex, and is nowhere near as effective as adding an extra 4GB to your server in most cases.

这篇关于我怎么知道哪个更合适的数据库设计? (作者,文章和评论)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆