替代数据库设计与列举列,导致性能差 [英] Alternative to database design with enumerated columns, leading to poor performances

查看:115
本文介绍了替代数据库设计与列举列,导致性能差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下关于我以前的问题的评论之后,我在这里介绍了导致我有枚举列的数据库模式的问题,导致表现不佳。

Following remarks on my previous question, I describe here the problem that led me to have a database schema with enumerated columns, leading to poor performances.

(请参阅本问题底部的整体结论)

我使用基因表达数据。我们捕获哪个条件 s表达任何基因(例如,表示基因X在条件中表达[器官Y - 生命阶段Z])。我有4个 dataType 可以产生这样的表达式数据。所以我的原始数据存储在不同的表格中,例如(这只是一个说明性的例子,原始数据复杂得多):

I work with gene expression data. We capture in which conditions any gene is expressed (for instance, to say that gene X is expressed in the condition[organ Y - life stage Z]). I have 4 dataTypes that can produce such expression data. So my raw data are stored in different tables, for instance (this is just an illustrative example, the raw data are far more complex):

+--------------------+------------------------------------+------+-----+--------------+-------+
| Field              | Type                               | Null | Key | Default      | Extra |
+--------------------+------------------------------------+------+-----+--------------+-------+
| geneId             | int(10) unsigned                   | NO   | PRI | NULL         |       |
| evidenceId         | varchar(70)                        | NO   | PRI | NULL         |       |
| experimentId       | varchar(70)                        | NO   | MUL | NULL         |       |
| conditionId        | mediumint(8) unsigned              | NO   | MUL | NULL         |       |
| expressionId       | int(10) unsigned                   | NO   | MUL | NULL         |       |
| detectionFlag      | enum('expressed', 'not expressed') | NO   |     | NULL         |       |
| quality            | enum('low quality','high quality') | NO   |     | NULL         |       |
+--------------------+------------------------------------+------+-----+--------------+-------+

我每个 dataType 有一个这样的表格。现在,典型的查询将同时请求数千个基因。因为数据非常大(每个表中有几亿行),并且包括冗余值(相同的基因的证据吨数,吨基因同样的证据),单独查询每个表格非常慢。因此,我们有一个预先计算的摘要表,根据这4个表中的信息计算:

I have one such table per dataType. Now, typical queries will request thousands of genes at a same time. Because the data are very large (several hundreds of millions of rows in each table), and includes redundant values (tons of evidence for a same gene, tons of genes for a same evidence), it is very slow to query each table individually. For this reason, we have a precomputed "summary" table, computed from the information in these 4 tables:

+----------------+-----------------------+------+-----+---------+----------------+
| Field          | Type                  | Null | Key | Default | Extra          |
+----------------+-----------------------+------+-----+---------+----------------+
| expressionId   | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId         | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId    | mediumint(8) unsigned | NO   | MUL | NULL    |                |
+----------------+-----------------------+------+-----+---------+----------------+

(请注意,此表中还有其他有用的列)。

(Note that there are other useful columns in this table). The expressionId field allows to get back to the raw data.

现在我的问题是:


  • 对于每种数据类型,我们基于支持表达式的不同实验的数量计算一个 summaryQuality 行,在条件本身,但也考虑到任何相关的条件(我放出什么是相关的条件,但是,条件之间的关系可以存储在另一个表中)

  • 用户应该可以通过对支持 dataType summaryQuality c $ c> s。例如,他们应该可以说通过dataType1和dataType2的实验总和来提供x实验的结果,或者从dataType1和dataType2以及dataType3和dataType4中的实验总和中得到y实验支持的结果

  • for each data type, we compute a summaryQuality, based on the number of distinct experiments supporting an expression row, in the condition itself, but also taking into account any related condition (I let out what is a related condition, but, yes, relations between conditions could be stored in another table).
  • Users should be able to compute a "global" summaryQuality, by summing experiments supporting an expression row for any combination of dataTypes. E.g., they should be able to say "give me results supported by x experiments from the sum of experiments in dataType1 and dataType2", or "give me results supported by y experiments from the sum of experiments in dataType1 and dataType2 and dataType3 and dataType4".

所以我结束了以下设计:

So I ended up with the following design:

+--------------------------+-----------------------+------+-----+---------+----------------+
| Field                    | Type                  | Null | Key | Default | Extra          |
+--------------------------+-----------------------+------+-----+---------+----------------+
| expressionId             | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId                   | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId              | mediumint(8) unsigned | NO   | MUL | NULL    |                |
| dataType1ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType2ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType3ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType4ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
+--------------------------+-----------------------+------+-----+---------+----------------+

该表中的行通过考虑所有 dataType s和所有相关的条件 s给定的 conditionId 。这是非常慢的计算。因此,表格有数亿行。

Rows in this table are precomputed by taking into account all dataTypes and all related conditions of a given conditionId. This is very slow to compute. The table has hundreds of millions of rows as a result.

现在我的查询看起来像:

Now my queries look like:

SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount + dataType3ExperimentCount + dataType4ExperimentCount) >= ?;
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount) >= ?;

表演非常糟糕,因为这样的查询不能使用索引,根据我之前的答案题。我需要允许 dataType s的任何组合。我需要允许将来添加新的 dataType (从而使组合数达到32或64非常快)。

The performances are very bad, because such queries can't use indexes, according to answers in my previous question. I need to allow any combination of dataTypes. I need to allow addition of new dataTypes in the future (thus making the number of combinations reaching 32 or 64 very fast).

可以想出更好的设计?

根据用户Rick James的要求编辑展示创建表:

Edit following request of user Rick James, the show create table:

CREATE TABLE `expression` (
  `expressionId` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `geneId` mediumint(8) unsigned NOT NULL,
  `conditionId` mediumint(8) unsigned NOT NULL,
  `dataType1ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType2ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType3ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType4ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`expressionId`),
  UNIQUE KEY `geneId` (`geneId`,`conditionId`),
  KEY `conditionId` (`conditionId`),
  CONSTRAINT `expression_ibfk_1` FOREIGN KEY (`geneId`) REFERENCES `gene` (`geneId`) ON DELETE CASCADE,
  CONSTRAINT `expression_ibfk_2` FOREIGN KEY (`conditionId`) REFERENCES `cond` (`conditionId`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

(是的,给定的 geneId 在表中的行数比给定的 conditionId 更少,所以多重唯一键被正确排列)。

(and, yes, a given geneId has less rows in table than a given conditionId, so the multiple unique key is properly ordered).

编辑,总体结论

Edit, overall conclusion:


  • @ RickJame的回答使我的查询运行速度提高了4到5倍,现在运行在合理的时间。

  • 但是@Strawberry是正确的,我的设计可以改进(见这个问题的评论)

  • 但是在MySQL上, 正确设计使我的查询运行速度降低10倍。我认为这是因为MySQL是一个基于行的数据库,很适合在单行中检索多个列中的所有信息,如我当前的枚举设计。

  • 我认为长期解决方案将是转换为基于列的数据库,如@ŁukaszKamiński的回答中提出的,使用@Strawberry提出的正确设计。因为这样,要检索的信息将在几行中,而是一列。

  • @RickJame's answer made my queries run 4 to 5 times faster, they now run in a reasonable time. Problem solved for now.
  • But @Strawberry is right that my design could be improved (see comments of this question)
  • But on MySQL, the "correct" design makes my queries run 10 times slower. I think this is because MySQL is a row-based database, well adapted at retrieving all information in multiple columns on a single row, as in my current "enumerated" design.
  • I think that the long term solution would be to switch to a column-based database, as proposed in @ŁukaszKamiński's answer, using the correct design proposed by @Strawberry. Because then, the information to retrieve would be in several rows, but a same column.

推荐答案

而不是

PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),

使用

PRIMARY KEY(`geneId`,`conditionId`),
INDEX (`expressionId`),

如果没有其他表正在重试 expressionId ,删除该列及其索引。

If no other tables are refencing expressionId, get rid of that column and the index on it.

为什么这有帮助?数据与主键进行聚类;您正在通过 geneId 查找数据,这是PK的开始;因此可以更有效地获取数据,特别是如果表比 innodb_buffer_pool_size (应该是RAM的大约70%)大得多。

Why does this help? The data is clustered with the Primary key; you are looking up data by geneId, which is the start of the PK; hence the data can be fetched more efficiently, especially if the table is a lot bigger than innodb_buffer_pool_size (which should be about 70% of RAM).

这篇关于替代数据库设计与列举列,导致性能差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆