SQL:计数和编号重复 - 优化相关子查询 [英] SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery
问题描述
我的数据类似于以下内容(第一行是标题行,表在以下被称为idcountdata):
id match1 match2 match3 data
1 AbCde BC 0 data01
2 AbCde BC 0 data02
3 AbCde BC 1 data03
4 AbCde AB 0 data04
5 FGhiJ BC 0 data05
6 FGhiJ AB 0 data06
7 FGhiJ BC 1 data07
8 FGhiJ BC 1 data08
9 FGhiJ BC 2 data09
10 HkLMop BC 1 data10
11 HkLMop BC 1 data11
12 HkLMop BC 1 data12
13 HkLMop DE 1 data13
14 HkLMop DE 2 data14
15 HkLMop DE 2 data15
16 HkLMop DE 2 data16
17 HkLMop DE 2 data17
我需要为上述生成的输出将是:
id match1 match2 match3 data matchid matchcount
1 AbCde BC 0 data01 1 2
2 AbCde BC 0 data02 2 2
3 AbCde BC 1 data03 1 1
4 AbCde AB 0 data04 1 1
5 FGhiJ BC 0 data05 1 1
6 FGhiJ AB 0 data06 1 1
7 FGhiJ BC 1 data07 1 2
8 FGhiJ BC 1 data08 2 2
9 FGhiJ BC 2 data09 1 1
10 HkLMop BC 1 data10 1 3
11 HkLMop BC 1 data11 2 3
12 HkLMop BC 1 data12 3 3
13 HkLMop DE 1 data13 1 1
14 HkLMop DE 2 data14 1 4
15 HkLMop DE 2 data15 2 4
16 HkLMop DE 2 data16 3 4
17 HkLMop DE 2 data17 4 4
以前我使用几个相关的子查询来实现如下: / p>
SELECT id,match1,match2,match3,data,
(SELECT count(*)FROM idcountdata d2
WHERE d1.match1 = d2.match1 AND d1.match2 = d2.match2 AND d1.match3 = d2.match3
AND d2.id< = d1.id)
AS matchid,
(SELECT count(*)FROM idcountdata d2
WHERE d1.match1 = d2.match1 AND d1.match2 = d2.match2 AND d1.match3 = d2.match3)
AS matchcount
FROM idcountdata D1;
但表格有超过20万行(数据长度/内容可变),因此这需要几个小时才能运行。 (奇怪的是,当我在2013年中期到2013年底再次使用相同的数据时,需要几分钟而不是几个小时,但这是旁边的 - 甚至在那以后,我以为这是无效和低效的。)
我已将上述matchcount的相关子查询转换为具有JOIN的不相关子查询,如下所示:
SELECT d1.id,d1.match1,d1.match2,d1.match3,d1.data,
matchcount
FROM idcountdata d1
JOIN
(SELECT id,match1,match2,match3,count(*)matchcount
FROM idcountdata
GROUP BY match1,match2,match3)d2
ON(d1.match1 = d2。 match1和d1.match2 = d2.match2和d1.match3 = d2.match3);
所以这只是我想要一些帮助优化的matchid的子查询。
简而言之,对于较大的数据集,以下查询运行速度太慢:
SELECT id,match1,match2,match3,数据,
(SELECT count(*)FROM idcountdata d2
WHERE d1.match1 = d2.match1 AND d1.match2 = d2.match2 AND d1.match3 = d2.match3
AND d2。 id< = d1.id)
matchid
FROM idcountdata d1;
如何提高上述查询的性能?
它不'必须在几秒钟内运行,但需要几分钟而不是几个小时(约20万行)。
加入可能比相关子查询更快
SELECT d1.id,d1.match1,d1.match2,d1.match3,d1 .data,count(*)matchid
FROM idcountdata d1
JOIN idcountdata d2 on d1.match1 = d2.match1
and d1.match2 = d2.match2
and d1.match3 = d2.match3
和d1.id> = d2.id
GROUP BY d1.id,d1.match1,d1.match2,d1.match3,d1.data
此查询可以利用(match1,match2,match3,id)上的复合索引
In an SQLite database I have one table where I need to count the duplicates across certain columns (i.e. rows where 3 particular columns are the same) and then also number each of these cases (i.e. if there are 2 occurrences of a particular duplicate, they need to be numbered as 1 and 2). I'm finding it a bit difficult to explain in words so I'll use a simplified example below.
The data I have is similar to the following (first line is header row, table is referenced in following as "idcountdata"):
id match1 match2 match3 data
1 AbCde BC 0 data01
2 AbCde BC 0 data02
3 AbCde BC 1 data03
4 AbCde AB 0 data04
5 FGhiJ BC 0 data05
6 FGhiJ AB 0 data06
7 FGhiJ BC 1 data07
8 FGhiJ BC 1 data08
9 FGhiJ BC 2 data09
10 HkLMop BC 1 data10
11 HkLMop BC 1 data11
12 HkLMop BC 1 data12
13 HkLMop DE 1 data13
14 HkLMop DE 2 data14
15 HkLMop DE 2 data15
16 HkLMop DE 2 data16
17 HkLMop DE 2 data17
And the output I need to generate for the above would be:
id match1 match2 match3 data matchid matchcount
1 AbCde BC 0 data01 1 2
2 AbCde BC 0 data02 2 2
3 AbCde BC 1 data03 1 1
4 AbCde AB 0 data04 1 1
5 FGhiJ BC 0 data05 1 1
6 FGhiJ AB 0 data06 1 1
7 FGhiJ BC 1 data07 1 2
8 FGhiJ BC 1 data08 2 2
9 FGhiJ BC 2 data09 1 1
10 HkLMop BC 1 data10 1 3
11 HkLMop BC 1 data11 2 3
12 HkLMop BC 1 data12 3 3
13 HkLMop DE 1 data13 1 1
14 HkLMop DE 2 data14 1 4
15 HkLMop DE 2 data15 2 4
16 HkLMop DE 2 data16 3 4
17 HkLMop DE 2 data17 4 4
Previously I was using a couple of correlated subqueries to achieve this as follows:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
AS matchid,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3)
AS matchcount
FROM idcountdata d1;
But the table has over 200,000 rows (and the data can be variable in length/content) and hence this takes hours to run. (Strangely, when I first used the same query on the same data back in mid-to-late 2013 it took minutes rather than hours, but that is beside the point - even back then I thought it was inelegant and inefficient.)
I've already converted the correlated subquery for "matchcount" in the above to an uncorrelated subquery with a JOIN as follows:
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data,
matchcount
FROM idcountdata d1
JOIN
(SELECT id,match1,match2,match3,count(*) matchcount
FROM idcountdata
GROUP BY match1,match2,match3) d2
ON (d1.match1=d2.match1 and d1.match2=d2.match2 and d1.match3=d2.match3);
So it's just the subquery for "matchid" that I would like some help to optimise.
In short, the following query runs too slowly for larger datasets:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
matchid
FROM idcountdata d1;
How can I improve the performance of the above query?
It doesn't have to run in seconds, but it needs to be minutes rather than hours (for around 200,000 rows).
A self join may be faster than a correlated subquery
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data, count(*) matchid
FROM idcountdata d1
JOIN idcountdata d2 on d1.match1 = d2.match1
and d1.match2 = d2.match2
and d1.match3 = d2.match3
and d1.id >= d2.id
GROUP BY d1.id, d1.match1, d1.match2, d1.match3, d1.data
This query can take advantage of a composite index on (match1,match2,match3,id)
这篇关于SQL:计数和编号重复 - 优化相关子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!