删除匹配行的更快方法? [英] Faster way to delete matching rows?

查看:90
本文介绍了删除匹配行的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于数据库,我是一个相对新手.我们正在使用MySQL,而我目前正在尝试加快似乎需要一段时间才能运行的SQL语句的速度.我四处寻找类似的问题,但没有找到一个问题.

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.

目标是删除表A中表B中具有匹配ID的所有行.

The goal is to remove all the rows in table A that have a matching id in table B.

我目前正在执行以下操作:

I'm currently doing the following:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);

表a中约有10万行,表b中约有22K行. "id"列是两个表的PK.

There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.

在我的测试箱上运行此语句大约需要3分钟-Pentium D,XP SP3、2GB内存,MySQL 5.0.67.在我看来,这很慢.也许不是,但是我希望加快速度.有没有更好/更快的方法来完成此任务?

This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?

一些其他信息可能会有所帮助.表A和B具有与创建表B相同的结构:

Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:

CREATE TABLE b LIKE a;

表a(以及表b)具有一些索引,以帮助加快对其进行的查询.同样,我还是DB工作的相对新手,现在仍在学习.我不知道这会对事物产生多大的影响(如果有的话).我认为它确实有效果,因为索引也必须清理,对吗?我还想知道是否还有其他数据库设置可能会影响速度.

Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.

另外,我正在使用INNO DB.

Also, I'm using INNO DB.

以下是一些可能对您有帮助的其他信息.

Here is some additional info that might be helpful to you.

表A的结构与此类似(我对此做了一些消毒):

Table A has a structure similar to this (I've sanitized this a bit):

DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE  `frobozz`.`a` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `fk_g` varchar(30) NOT NULL,
  `h` int(10) unsigned default NULL,
  `i` longtext,
  `j` bigint(20) NOT NULL,
  `k` bigint(20) default NULL,
  `l` varchar(45) NOT NULL,
  `m` int(10) unsigned default NULL,
  `n` varchar(20) default NULL,
  `o` bigint(20) NOT NULL,
  `p` tinyint(1) NOT NULL,
  PRIMARY KEY  USING BTREE (`id`),
  KEY `idx_l` (`l`),
  KEY `idx_h` USING BTREE (`h`),
  KEY `idx_m` USING BTREE (`m`),
  KEY `idx_fk_g` USING BTREE (`fk_g`),
  KEY `fk_g_frobozz` (`id`,`fk_g`),
  CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

我怀疑问题的一部分在于此表有许多索引. 表B与表B相似,尽管它只包含列idh.

I suspect that part of the issue is there are a number of indexes for this table. Table B looks similar to table B, though it only contains the columns id and h.

此外,分析结果如下:

starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002


已解决

感谢所有回复和评论.他们当然让我考虑了这个问题.恭喜 dotjoe ,让我通过问一个简单的问题还有其他表是否引用a.id吗?"

Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"

问题是表A上有一个DELETE TRIGGER,它调用了一个存储过程来更新另外两个表C和D.表C的FK返回a.id,并且在执行了与该id相关的一些操作后存储过程,它有一条语句,

The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,

DELETE FROM c WHERE c.id = theId;

我查看了EXPLAIN语句,并将其改写为

I looked into the EXPLAIN statement and rewrote this as,

EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;

所以,我可以看到它在做什么,并且它提供了以下信息:

So, I could see what this was doing and it gave me the following info:

id            1
select_type   SIMPLE
table         c
type          ALL
possible_keys NULL
key           NULL
key_len       NULL
ref           NULL
rows          2633
Extra         using where

这告诉我这是一个痛苦的操作,并且由于要被调用22500次(对于给定的数据集被删除),因此出现了问题.一旦我在other_id列上创建了INDEX并重新运行EXPLAIN,我得到:

This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:

id            1
select_type   SIMPLE
table         c
type          ref
possible_keys Index_1
key           Index_1
key_len       8
ref           const
rows          1
Extra         

好得多,实际上很棒.

我补充说,Index_1和删除时间与 mattkemp 报告的时间一致.就我而言,这是一个非常微妙的错误,因为在最后一刻出现了一些附加功能.事实证明,大多数建议的替代DELETE/SELECT语句(如 Daniel 所述)最终花费的时间基本上相同,并且如 soulmerge 所述,该语句很漂亮.我将能够根据自己的需要构造出最好的东西.一旦为另一个表C提供了索引,我的DELETE就很快了.

I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

尸检:
从这个练习中学到了两个教训.首先,很明显,我没有利用EXPLAIN语句的功能来更好地了解SQL查询的影响.那是一个菜鸟的错误,所以我不会为那个问题而自责.我将从错误中学习.其次,令人反感的代码是快速完成"思路的结果,而设计/测试不足则导致该问题不会很快出现.如果我生成了多个可观的测试数据集以用作此新功能的测试输入,那么我不会浪费我的时间,也不会浪费你的时间.我在数据库方面的测试缺乏应用程序方面的深度.现在,我有机会改善这一点.

Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.

参考:EXPLAIN语句

推荐答案

从InnoDB删除数据是您可以要求的最昂贵的操作.您已经发现查询本身不是问题-无论如何,大多数查询都会针对相同的执行计划进行优化.

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.

虽然可能很难理解为什么所有情况下的DELETE速度最慢,但有一个相当简单的解释. InnoDB是一个事务存储引擎.这意味着,如果您的查询在中途中止,则所有记录仍将保留在原地,就像什么也没发生一样.完成后,所有内容将在同一瞬间消失.在DELETE期间,连接到服务器的其他客户端将看到记录,直到完成DELETE.

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

为实现这一目标,InnoDB使用了一种称为MVCC(多版本并发控制)的技术.它的基本作用是为每个连接提供整个数据库的快照视图,就像事务的第一条语句开始时一样.为此,InnoDB内部的每个记录可以有多个值-每个快照一个.这也是为什么在InnoDB上进行计数需要一些时间的原因-这取决于您当时看到的快照状态.

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.

对于DELETE事务,根据查询条件标识的每条记录都将标记为删除.由于其他客户端可能同时访问数据,因此它无法立即将它们从表中删除,因为它们必须查看各自的快照以保证删除的原子性.

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

一旦所有记录都被标记为删除,事务将被成功提交.即使这样,也不能在将所有与快照值一起使用的其他事务(在DELETE事务之前)都结束之前,立即将它们从实际数据页中删除.

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

因此,考虑到必须修改所有记录以便为以安全交易方式将其删除做准备,您的3分钟实际上并没有那么慢.语句运行时,您可能会听到"硬盘工作.这是由于访问所有行引起的. 为了提高性能,您可以尝试增加服务器的InnoDB缓冲池大小,并尝试在删除时限制对数据库的其他访问,从而也减少了InnoDB每条记录必须维护的历史版本数. 有了额外的内存,InnoDB也许可以将您的表(大部分)读到内存中,从而避免一些磁盘搜索时间.

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows. To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record. With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

这篇关于删除匹配行的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆