替代数据库设计与列举列,导致性能差 [英] Alternative to database design with enumerated columns, leading to poor performances
问题描述
以下关于我以前的问题的评论之后,我在这里介绍了导致我有枚举列的数据库模式的问题,导致表现不佳。
Following remarks on my previous question, I describe here the problem that led me to have a database schema with enumerated columns, leading to poor performances.
(请参阅本问题底部的整体结论)
我使用基因表达数据。我们捕获哪个条件
s表达任何基因
(例如,表示基因X在条件中表达[器官Y - 生命阶段Z])。我有4个 dataType
可以产生这样的表达式数据。所以我的原始数据存储在不同的表格中,例如(这只是一个说明性的例子,原始数据复杂得多):
I work with gene expression data. We capture in which condition
s any gene
is expressed (for instance, to say that gene X is expressed in the condition[organ Y - life stage Z]). I have 4 dataType
s that can produce such expression data. So my raw data are stored in different tables, for instance (this is just an illustrative example, the raw data are far more complex):
+--------------------+------------------------------------+------+-----+--------------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------------------------+------+-----+--------------+-------+
| geneId | int(10) unsigned | NO | PRI | NULL | |
| evidenceId | varchar(70) | NO | PRI | NULL | |
| experimentId | varchar(70) | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
| expressionId | int(10) unsigned | NO | MUL | NULL | |
| detectionFlag | enum('expressed', 'not expressed') | NO | | NULL | |
| quality | enum('low quality','high quality') | NO | | NULL | |
+--------------------+------------------------------------+------+-----+--------------+-------+
我每个 dataType
有一个这样的表格。现在,典型的查询将同时请求数千个基因。因为数据非常大(每个表中有几亿行),并且包括冗余值(相同的基因的证据吨数
,吨基因
同样的证据),单独查询每个表格非常慢。因此,我们有一个预先计算的摘要表,根据这4个表中的信息计算:
I have one such table per dataType
. Now, typical queries will request thousands of genes at a same time. Because the data are very large (several hundreds of millions of rows in each table), and includes redundant values (tons of evidence for a same gene
, tons of gene
s for a same evidence), it is very slow to query each table individually. For this reason, we have a precomputed "summary" table, computed from the information in these 4 tables:
+----------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-----------------------+------+-----+---------+----------------+
| expressionId | int(10) unsigned | NO | PRI | NULL | auto_increment |
| geneId | int(10) unsigned | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
+----------------+-----------------------+------+-----+---------+----------------+
(请注意,此表中还有其他有用的列)。
(Note that there are other useful columns in this table). The expressionId
field allows to get back to the raw data.
现在我的问题是:
- 对于每种数据类型,我们基于支持表达式的不同实验的数量计算一个
summaryQuality
行,在条件
本身,但也考虑到任何相关的条件
(我放出什么是相关的条件
,但是,条件
之间的关系可以存储在另一个表中) - 用户应该可以通过对支持
dataType $的任何组合的表达式行的总结实验来计算全局
summaryQuality
c $ c> s。例如,他们应该可以说通过dataType1和dataType2的实验总和来提供x实验的结果,或者从dataType1和dataType2以及dataType3和dataType4中的实验总和中得到y实验支持的结果
- for each data type, we compute a
summaryQuality
, based on the number of distinct experiments supporting an expression row, in thecondition
itself, but also taking into account any relatedcondition
(I let out what is a relatedcondition
, but, yes, relations betweencondition
s could be stored in another table). - Users should be able to compute a "global"
summaryQuality
, by summing experiments supporting an expression row for any combination ofdataType
s. E.g., they should be able to say "give me results supported by x experiments from the sum of experiments in dataType1 and dataType2", or "give me results supported by y experiments from the sum of experiments in dataType1 and dataType2 and dataType3 and dataType4".
所以我结束了以下设计:
So I ended up with the following design:
+--------------------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------------+------+-----+---------+----------------+
| expressionId | int(10) unsigned | NO | PRI | NULL | auto_increment |
| geneId | int(10) unsigned | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
| dataType1ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType2ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType3ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType4ExperimentCount | smallint(5) unsigned | NO | | 0 | |
+--------------------------+-----------------------+------+-----+---------+----------------+
该表中的行通过考虑所有 dataType
s和所有相关的条件
s给定的 conditionId
。这是非常慢的计算。因此,表格有数亿行。
Rows in this table are precomputed by taking into account all dataType
s and all related condition
s of a given conditionId
. This is very slow to compute. The table has hundreds of millions of rows as a result.
现在我的查询看起来像:
Now my queries look like:
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount + dataType3ExperimentCount + dataType4ExperimentCount) >= ?;
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount) >= ?;
表演非常糟糕,因为这样的查询不能使用索引,根据我之前的答案题。我需要允许 dataType
s的任何组合。我需要允许将来添加新的 dataType
(从而使组合数达到32或64非常快)。
The performances are very bad, because such queries can't use indexes, according to answers in my previous question. I need to allow any combination of dataType
s. I need to allow addition of new dataType
s in the future (thus making the number of combinations reaching 32 or 64 very fast).
可以想出更好的设计?
根据用户Rick James的要求编辑展示创建表:
Edit following request of user Rick James, the show create table:
CREATE TABLE `expression` (
`expressionId` int(10) unsigned NOT NULL AUTO_INCREMENT,
`geneId` mediumint(8) unsigned NOT NULL,
`conditionId` mediumint(8) unsigned NOT NULL,
`dataType1ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType2ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType3ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType4ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),
KEY `conditionId` (`conditionId`),
CONSTRAINT `expression_ibfk_1` FOREIGN KEY (`geneId`) REFERENCES `gene` (`geneId`) ON DELETE CASCADE,
CONSTRAINT `expression_ibfk_2` FOREIGN KEY (`conditionId`) REFERENCES `cond` (`conditionId`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
(是的,给定的 geneId
在表中的行数比给定的 conditionId
更少,所以多重唯一键被正确排列)。
(and, yes, a given geneId
has less rows in table than a given conditionId
, so the multiple unique key is properly ordered).
编辑,总体结论
Edit, overall conclusion:
- @ RickJame的回答使我的查询运行速度提高了4到5倍,现在运行在合理的时间。
- 但是@Strawberry是正确的,我的设计可以改进(见这个问题的评论)
- 但是在MySQL上, 正确设计使我的查询运行速度降低10倍。我认为这是因为MySQL是一个基于行的数据库,很适合在单行中检索多个列中的所有信息,如我当前的枚举设计。
- 我认为长期解决方案将是转换为基于列的数据库,如@ŁukaszKamiński的回答中提出的,使用@Strawberry提出的正确设计。因为这样,要检索的信息将在几行中,而是一列。
- @RickJame's answer made my queries run 4 to 5 times faster, they now run in a reasonable time. Problem solved for now.
- But @Strawberry is right that my design could be improved (see comments of this question)
- But on MySQL, the "correct" design makes my queries run 10 times slower. I think this is because MySQL is a row-based database, well adapted at retrieving all information in multiple columns on a single row, as in my current "enumerated" design.
- I think that the long term solution would be to switch to a column-based database, as proposed in @ŁukaszKamiński's answer, using the correct design proposed by @Strawberry. Because then, the information to retrieve would be in several rows, but a same column.
推荐答案
而不是
PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),
使用
PRIMARY KEY(`geneId`,`conditionId`),
INDEX (`expressionId`),
如果没有其他表正在重试 expressionId
,删除该列及其索引。
If no other tables are refencing expressionId
, get rid of that column and the index on it.
为什么这有帮助?数据与主键进行聚类;您正在通过 geneId
查找数据,这是PK的开始;因此可以更有效地获取数据,特别是如果表比 innodb_buffer_pool_size
(应该是RAM的大约70%)大得多。
Why does this help? The data is clustered with the Primary key; you are looking up data by geneId
, which is the start of the PK; hence the data can be fetched more efficiently, especially if the table is a lot bigger than innodb_buffer_pool_size
(which should be about 70% of RAM).
这篇关于替代数据库设计与列举列,导致性能差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!