使用相关子查询删除 4200 万行表的 SQL? [英] DELETE SQL with correlated subquery for table with 42 million rows?

查看:33
本文介绍了使用相关子查询删除 4200 万行表的 SQL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表 cats 有 42,795,120 行.

显然这是很多行.所以当我这样做时:

/* owner_cats 是一个多对多连接表 */从猫中删除WHERE cat.id_cat IN (SELECT owner_cats.id_cat FROM owner_catsWHERE owner_cats.id_owner = 1)

查询超时:(

(我需要增加我的 CommandTimeout 值,默认只有 30 秒)

我不能使用 TRUNCATE TABLE 猫,因为我不想从其他主人那里吹走猫.

我正在使用 SQL Server 2005,并将恢复模型"设置为简单".

所以,我想过做这样的事情(顺便说一句,从应用程序执行这个 SQL):

从猫中删除 TOP (25) PERCENTWHERE cat.id_cat IN (SELECT owner_cats.id_cat FROM owner_catsWHERE owner_cats.id_owner = 1)从猫中删除 TOP(50) %WHERE cat.id_cat IN (SELECT owner_cats.id_cat FROM owner_catsWHERE owner_cats.id_owner = 1)从猫中删除WHERE cat.id_cat IN (SELECT owner_cats.id_cat FROM owner_catsWHERE owner_cats.id_owner = 1)

我的问题是:SQL Server 2005中可以DELETE的行数阈值是多少?

或者,如果我的方法不是最佳的,请提出更好的方法.谢谢.

这篇文章对我的帮助还不够:

编辑 (8/6/2010):

好的,我再次阅读上述链接后才意识到这些表上没有索引.此外,你们中的一些人已经在下面的评论中指出了这个问题.请记住,这是一个虚构的架构,因此即使 id_cat 也不是 PK,因为在我的现实生活架构中,它不是唯一的字段.

我会添加索引:

  1. cats.id_cat
  2. owner_cats.id_cat
  3. owner_cats.id_owner

我想我仍然掌握了这个数据仓库的窍门,显然我需要所有 JOIN 字段的索引,对吗?

但是,我需要花费数小时才能完成此批量加载过程.我已经将它作为 SqlBulkCopy(以块为单位,而不是一次 4200 万个).我有一些索引和 PK.我阅读了以下帖子,这些帖子证实了我的理论,即即使是批量复制,索引也会减慢:

所以我将在复制之前DROP我的索引,然后在完成后重新CREATE它们.

由于加载时间很长,我需要一段时间来测试这些建议.我会报告结果.

更新 (8/7/2010):

汤姆建议:

删除从猫 c存在的地方(选择 1FROM owner_cats o哪里 o.id_cat = c.id_catAND o.id_owner = 1)

并且仍然没有索引,对于 4200 万行,使用上述方式需要 13:21 分:秒与 22:08.然而,对于 1300 万行,他以 2:13 和我以前的 2:10 的方式进行.这是个好主意,但我仍然需要使用索引!

更新(8/8/2010):

有些不对劲!现在打开索引后,我上面的第一个删除查询花费了 1:9 hrs:min (是的一个小时!) 而不是 22:08 min:sec 和 13:21 min:sec 与 2:10 min:sec 分别为 4200 万行和 1300 万行.我现在要尝试使用索引进行 Tom 的查询,但这是朝着错误的方向前进.请帮忙.

更新 (8/9/2010):

Tom 删除 4200 万行需要 1:06 小时:分钟,1300 万行索引需要 10:50 分:秒,而分别是 13:21 分:秒和 2:13 分:秒.当我使用索引数量级时,删除在我的数据库上花费的时间更长!我想我知道为什么,我的数据库 .mdf 和 .ldf 从 3.5 GB 增长到 40.6GB 在第一次(4200 万)删除!我做错了什么?

更新(8/10/2010):

由于没有任何其他选择,我想出了一个我觉得很乏味的解决方案(希望是暂时的):

  1. 将数据库连接的超时时间增加到 1 小时(CommandTimeout=60000; 默认为 30 秒)
  2. 使用 Tom 的查询:DELETE FROM WHERE EXISTS (SELECT 1 ...) 因为它执行得更快一些
  3. DROP 运行删除语句前的所有索引和 PK (???)
  4. 运行DELETE语句
  5. CREATE 所有索引和 PKs

看起来很疯狂,但至少它比使用 TRUNCATE 并从第一个 owner_id 开始重新加载要快,因为我的一个 owner_id 需要 2:30 hrs:min 加载,而我刚刚描述的删除过程需要 17:22 min:sec 4200 万行.(注意:如果我的加载过程抛出异常,我会重新开始那个 owner_id,但我不想吹走以前的 owner_id,所以我不想到 TRUNCATE owner_cats 表,这就是我尝试使用 DELETE 的原因.)

仍将不胜感激:)

解决方案

您是否尝试过不使用子查询而使用连接?

删除猫发件人猫内部连接 ​​owner_cats oc在 c.id_cat = oc.id_cat哪里id_owner =1

如果你有,你也尝试过不同的加入提示,例如

删除猫发件人猫INNER HASH JOIN owner_cats oc在 c.id_cat = oc.id_cat哪里id_owner =1

I have a table cats with 42,795,120 rows.

Apparently this is a lot of rows. So when I do:

/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

the query times out :(

(edit: I need to increase my CommandTimeout value, default is only 30 seconds)

I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.

I'm using SQL Server 2005 with "Recovery model" set to "Simple."

So, I thought about doing something like this (executing this SQL from an application btw):

DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?

Or, if my approach is not optimal, please suggest a better approach. Thanks.

This post didn't help me enough:

EDIT (8/6/2010):

Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.

I will put indexes on:

  1. cats.id_cat
  2. owner_cats.id_cat
  3. owner_cats.id_owner

I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?

However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:

So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.

Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.

UPDATE (8/7/2010):

Tom suggested:

DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)

And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!

Update (8/8/2010):

Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.

Update (8/9/2010):

Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?

Update (8/10/2010):

For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):

  1. Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
  2. Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
  3. DROP all indexes and PKs before running delete statement (???)
  4. Run DELETE statement
  5. CREATE all indexes and PKs

Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)

Anymore help would still be appreciated :)

解决方案

Have you tried no Subquery and use a join instead?

DELETE cats 
FROM
 cats c
 INNER JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

And if you have have you also tried different Join hints e.g.

DELETE cats 
FROM
 cats c
 INNER HASH JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

这篇关于使用相关子查询删除 4200 万行表的 SQL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆