如何选择和/或删除表中每组重复的一行? [英] How to select and/or delete all but one row of each set of duplicates in a table?

查看:104
本文介绍了如何选择和/或删除表中每组重复的一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个包含四列的MySQL表:



ID
DRIVER_ID
CAR_ID
NOTES(对于大多数行)



我有一堆重复的行,其中DRIVER_ID和CAR_ID是相同的。对于DRIVER_ID和CAR_ID的每一对,我想要一行。如果集合中的一行具有非NULL NOTES,那么我想要一个,否则无关紧要。



所以如果我有:

  ID | DRIVER_ID | CAR_ID |注意
1 1 1 NULL
2 1 1 NULL
3 1 2 NULL
4 1 2 NULL
5 2 3 NULL
6 2 3 NULL
7 2 3 NULL
8 2 3 hi
9 3 5 NULL

我想保留以下ID:9,8,然后保留[3,4]和[1,2]中的每一个。



这是一个巨大的表,而我所尝试的笨重的方法是疯狂的慢,到了我肯定我要做的一切都错了。如何有效地a)选择要删除的ID列表? b)在相同的查询中删除它们



(是的,我知道复合键的处理,这不是问题。)



编辑:对不起,忘了指定这是MySQL。



目前为止我已经尝试过的一些东西:

$ CAR $ ID $ COUNT(DRIVER_ID)CAR_IDIVER组中的行计数,CAR_ID,DRIVER_ID HAVING rowcount>

  1; 

每个组会收到一个ID。但是,如果有一个,它不一定会留下NOTES的行。它也将只给我一个ID每个重复组。有一些情况下,有20多个重复的组合,所以我需要一遍又一遍地将每个组别排成一行。

从CARS_DRIVERS t1中选择不同的t1.ID,其中存在(select * from CARS_DRIVERS t2,其中t2.CAR_ID = t1.CAR_ID和t2.DRIVER_ID = t1.DRIVER_ID和t2.id> t1.id); 

这慢得多,仍然没有真正解决NOTES问题。它确实具有获取每个组最旧的行的优点,如果我不能在NOTES字段上容易地孤立,可以是代理。如果一组中有一行有NOTES,我相信它总是最老的(一个ID最低的),但我不确定。



一些额外的上下文: DRIVER_ID和CAR_ID不是真正的列名,表中还有其他列。我正在尝试提炼出信息来解决问题的根源,但是从W4M的评论中我看到,这使它看起来像一个家庭作业。真正的交易是,我正在查看一个非常优化的数据库(不是我的权限),而在添加密钥之前尝试去除这些重复数据时,操作将永远消失。像在,小时。桌子很大,但肯定没有道理。我试图用我有限的SQL专业知识,并找出一种方法来完成这个工作。没关系,如果它很漂亮,我可以坐在命令行,并强制一些查询,如果必要的话。但是我注意到,作为删除候选的选择ID只需要几秒钟,虽然表格很大,但是要删除的总行数少于10k,所以必须有一种方法可以在没有一些脚本的情况下进行这种操作一整个周末完成。

解决方案

这是一个解决方案。我在MySQL 5.5.8中测试了这个。

  SELECT MAX(COALESCE(c2.id,c1.id))AS id, 
c1.driver_id,c1.car_id,
c2.notes AS note
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
ON(c1.driver_id,c1 .car_id)=(c2.driver_id,c2.car_id)AND c2.notes IS NOT NULL
GROUP BY c1.driver_id,c1.car_id,c2.notes;

我将c2.notes作为GROUP BY键,因为您可能有多个行,每个值为driver_id,car_id为零。



使用示例数据的结果:

  + ---- -  + ----------- + -------- + ------- + 
| id | driver_id | car_id |笔记|
+ ------ + ----------- + -------- + ------- +
| 2 | 1 | 1 | NULL |
| 4 | 2 | 1 | NULL |
| 8 | 3 | 2 |嗨|
| 9 | 5 | 3 | NULL |
+ ------ + ----------- + -------- + ------- +

关于删除。在您的示例数据中,它始终是每个driver_id&你要保留的car_id。如果你可以依赖这个,你可以做一个多表删除,删除所有行,具有较高的id值和相同的driver_id& car_id存在:

  DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2 
ON(c1.driver_id,c1。 car_id)=(c2.driver_id,c2.car_id)AND c1.id< c2.id;

这自然会跳过任何情况,只有一行存在一个给定的一对driver_id& car_id值,因为内部连接的条件需要具有不同id值的两行。



但是,如果不能依赖于每个组的最新ID想要保持,解决方案比较复杂。这可能比在一个语句中要解决的更复杂,所以在两个语句中。



在添加了几行以进行测试之后,我也进行了测试: / p>

  INSERT INTO cars_drivers VALUES(10,2,3,NULL),(11,2,3,'bye'); 

+ ---- + -------- + ----------- + ------- +
| id | car_id | driver_id |笔记|
+ ---- + -------- + ----------- + ------- +
| 1 | 1 | 1 | NULL |
| 2 | 1 | 1 | NULL |
| 3 | 1 | 2 | NULL |
| 4 | 1 | 2 | NULL |
| 5 | 2 | 3 | NULL |
| 6 | 2 | 3 | NULL |
| 7 | 2 | 3 | NULL |
| 8 | 2 | 3 |嗨|
| 9 | 3 | 5 | NULL |
| 10 | 2 | 3 | NULL |
| 11 | 2 | 3 |再见|
+ ---- + -------- + ----------- + ------- +

首先删除带有零注释的行,其中存在非空注释的行。

  DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2 
ON(c1.driver_id,c1.car_id)=(c2.driver_id,c2.car_id)
WHERE c1 .notes IS NULL AND c2.notes IS NOT NULL;

+ ---- + -------- + ----------- + ------- +
| id | car_id | driver_id |笔记|
+ ---- + -------- + ----------- + ------- +
| 1 | 1 | 1 | NULL |
| 2 | 1 | 1 | NULL |
| 3 | 1 | 2 | NULL |
| 4 | 1 | 2 | NULL |
| 8 | 2 | 3 |嗨|
| 9 | 3 | 5 | NULL |
| 11 | 2 | 3 |再见|
+ ---- + -------- + ----------- + ------- +

其次,删除每组重复的所有最高ID行。



<$ p $ (c1.driver_id,c1.car_id)=(c2.driver_id,c2.car_id)AND c1.id< code $ c c2.id;

+ ---- + -------- + ----------- + ------- +
| id | car_id | driver_id |笔记|
+ ---- + -------- + ----------- + ------- +
| 2 | 1 | 1 | NULL |
| 4 | 1 | 2 | NULL |
| 9 | 3 | 5 | NULL |
| 11 | 2 | 3 |再见|
+ ---- + -------- + ----------- + ------- +


Let's say I have a MySQL table with four columns:

ID DRIVER_ID CAR_ID NOTES (NULL for most rows)

I have a bunch of duplicate rows where DRIVER_ID and CAR_ID are the same. For each pair of DRIVER_ID and CAR_ID, I want one row. If one of the rows in the set has non-NULL NOTES, I want that one, but otherwise it doesn't matter.

so if I have:

ID  |  DRIVER_ID  |  CAR_ID  |  NOTES
1      1             1          NULL
2      1             1          NULL
3      1             2          NULL
4      1             2          NULL
5      2             3          NULL
6      2             3          NULL
7      2             3          NULL
8      2             3          hi
9      3             5          NULL

I want to keep the following IDs: 9, 8, and then one each of [3,4] and [1,2].

It's a huge table, and the clunky methods I've tried are insanely slow, to the point where I'm sure I'm going about it all wrong. How can I efficiently a) select the list of IDs to delete? b) delete them in the same query?

(And yes, I know the deal with composite keys. That's not an issue here.)

EDIT: Sorry, forgot to specify that this was MySQL.

Some of the stuff I've tried so far:

select ID, COUNT(DRIVER_ID) rowcount from CARS_DRIVERS group by CAR_ID,DRIVER_ID HAVING rowcount > 1;

will get me one ID per group. It doesn't necessarily leave the row with NOTES if there is one, though. It will also only get me one ID per duplicate group. There are some cases where there are 20+ duplicate combos, so I would need to iterate that over and over to whittle each group down to a single row.

select distinct t1.ID from CARS_DRIVERS t1 where exists (select * from CARS_DRIVERS t2 where t2.CAR_ID = t1.CAR_ID and t2.DRIVER_ID = t1.DRIVER_ID and t2.id > t1.id);

This is much slower, and still doesn't really address the NOTES issue. It does have the advantage of getting the oldest row for each group, which, if I can't isolate on the NOTES field easily, could be a proxy for that. If a row in a set has NOTES, I believe it's always the oldest one (one with the lowest ID), but I'm not certain.

Some additional context: DRIVER_ID and CAR_ID are not the real column names, and there are other columns in the table. I was trying to distill down the info to get at the root of the problem, but I see from W4M's comment that this makes it look like a homework assignment. The real deal is that I'm looking at a very unoptimized database (not my purview normally) and when trying to get rid of these dupes before adding a key, the operation is taking forever. As in, hours. The table is big but certainly doesn't justify that. I'm trying to pitch in with my limited SQL expertise and figure out a way to get this done. Doesn't matter if it's pretty, I can sit at the command line and brute-force a bunch of queries if necessary. But I noticed that SELECTing IDs that are candidates for deletion only takes a few seconds, and although the table is huge, the total number of rows to delete is less than 10k so there must be a way to make this happen without some script that takes a whole weekend to finish.

解决方案

Here's one solution. I tested this on MySQL 5.5.8.

SELECT MAX(COALESCE(c2.id, c1.id)) AS id,
 c1.driver_id, c1.car_id,
 c2.notes AS notes
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c2.notes IS NOT NULL
GROUP BY c1.driver_id, c1.car_id, c2.notes;

I include c2.notes as a GROUP BY key because you might have more than one row with non-null notes per values of driver_id,car_id.

Result using your example data:

+------+-----------+--------+-------+
| id   | driver_id | car_id | notes |
+------+-----------+--------+-------+
|    2 |         1 |      1 | NULL  |
|    4 |         2 |      1 | NULL  |
|    8 |         3 |      2 | hi    |
|    9 |         5 |      3 | NULL  |
+------+-----------+--------+-------+

Regarding deleting. In your example data, it's always the highest id value per driver_id & car_id that you want to keep. If you can depend on that, you can do a multi-table delete that deletes all rows for which a row with a higher id value and the same driver_id & car_id exists:

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

This naturally skips any cases where only one row exists with a given pair of driver_id & car_id values, because the conditions of the inner join require two rows with different id values.

But if you can't depend on the latest id per group being the one you want to keep, the solution is more complex. It's probably more complex than it's worth to solve in one statement, so do it in two statements.

I tested this too, after adding a couple more rows for testing:

INSERT INTO cars_drivers VALUES (10,2,3,NULL), (11,2,3,'bye');

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  5 |      2 |         3 | NULL  |
|  6 |      2 |         3 | NULL  |
|  7 |      2 |         3 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 10 |      2 |         3 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

First delete rows with null notes, where a row with non-null notes exists.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id)
WHERE c1.notes IS NULL AND c2.notes IS NOT NULL;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

Second, delete all but the highest-id row from each group of duplicates.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  2 |      1 |         1 | NULL  |
|  4 |      1 |         2 | NULL  |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

这篇关于如何选择和/或删除表中每组重复的一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆