大型数据集上sum(column_name),sum(1)和count(*)之间的巨大性能差异 [英] Huge performance differences between sum(column_name), sum(1) and count(*) on a large dataset
问题描述
编辑:
由于你们建议为球员/锦标赛名称创建单独的表并用外键替换字符串,所以我做了以下操作:
Since you guys suggested creating separate tables for player/tournament names and replacing strings with foreign keys I did the following:
SELECT DISTINCT tournament INTO tournaments FROM chess_data2
ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY
我重复了namew和nameb的操作,然后用外键替换了字符串。这是棘手的地方-我无法在合法时间内做到这一点。
I repeated that for namew and nameb, then went about replacing strings with foreign keys. Here is where it got tricky - I am not able to do it in a "legit" time.
我尝试了以下两种方法:
1)删除现有索引
1)为namew创建单个索引,名称b和锦标赛分别
1)运行查询,将我想要的数据插入到新表中:
I tried two approaches was the following:
1) Delete existing indexes
1) Create individual indexes for namew, nameb and tournament separately
1) Run the query inserting the data I want into a new table:
SELECT date, whiterank, blackrank, t_round, result,
(SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep,
(SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp,
(SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament
INTO final_chess from chess_data2 c_d2
很不幸,它非常慢,所以我回到了用户Boris Shchegolev的身边。在评论中,他建议在现有表Chess_data2中创建一个新列并进行更新。所以我做到了:
Unfortunately it was very slow, so I came back to user Boris Shchegolev. In comment, he suggested creating a new column in the existing table chess_data2 and updating. So I did it:
ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER
UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)"
半个小时前,我开始了这些查询,第一个是即时的,但第二个是永远的。
I started those queries half an hour ago, the first one was instant but the second one takes forever.
我现在该怎么办?
初始问题:
数据库架构:
日期DATE
namew TEXT
nameb TEXT
whiterank INTEGER
blackrank INTEGER
比赛TEXT
t_round INTEGER < br>
结果REAL
id BIGINT
Chess_data2_pkey(id)
black_index(名称b,比赛,日期)
Chess_data2_pkey(id)唯一
w_b_t_d_index(namew,nameb,锦标赛,日期)
white_index(namew,锦标赛,日期)
Database schema:
date DATE
namew TEXT
nameb TEXT
whiterank INTEGER
blackrank INTEGER
tournament TEXT
t_round INTEGER
result REAL
id BIGINT
chess_data2_pkey(id)
black_index (nameb, tournament, date)
chess_data2_pkey(id) UNIQUE
w_b_t_d_index (namew, nameb, tournament, date)
white_index (namew, tournament, date)
问题:
以下统计信息的性能水泥非常好(〜60-70秒。在具有3百万个条目的数据库中):
Problem:
The performance of the following statement is very good (~60-70 sec. in a database with 3 mln entries):
# Number of points that the white player has so far accrued throughout the tournament
(SELECT coalesce(SUM(result),0) from chess_data2 t2
where (t1.namew = t2.namew) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90)
+ SELECT coalesce(SUM(1-result),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w
from chessdata2 t1
同时,以下选择(
# Number of games that the white player has so far played in the tournament
(SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(count(*),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1
我尝试了另一种方法(加和),但也没有更好的结果:
I tried a different approach (with sum) and it didn't go better either:
# Number of games that the white player has so far played in the tournament
(SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(sum(1),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1
您知道这里发生了什么以及如何解决吗?我正在PyCharm中使用python 3.5和psycopg2来运行这些查询。我将很高兴提供任何其他信息,因为这对我来说是一个非常重要的项目。
EXPLAIN ANALYZE (用于最后一个查询):
Any idea what's going on here and how to fix that? I'm using python 3.5 and psycopg2 in PyCharm to run those queries. I will be very happy to provide any additional information as it is a very important project for me.
EXPLAIN ANALYZE (Used for the last query):
Seq Scan on chess_data2 t1 (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1)
Planning time: 0.161 ms
Execution time: 81883.716 ms
SubPlan 2
SubPlan 1
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Index Only Scan using white_index on chess_data2 t2 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185)
-> Index Only Scan using black_index on chess_data2 t2_1 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185)
Rows Removed by Filter: 1
Rows Removed by Filter: 1
Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Heap Fetches: 6009767
Heap Fetches: 5303160
Filter: (t1.date < (date + 90))
Filter: (t1.date < (date + 90))
推荐答案
由于表设计不佳,查询的性能不佳。从EXPLAIN可以明显看出,数据库使用索引,但索引字段都是 TEXT
,索引很大。
The queries are performing poorly due to poor table design. From the EXPLAIN it is obvious that the database uses indexes, but the the indexed fields are all TEXT
and the indexes are huge.
要修复它:
- 创建表
名称
- 用
namew_id $ c $替换
namew
和nameb
c>和nameb_id
,都引用名称
- 创建表
锦标赛
- 将
锦标赛
替换为tournament_id
引用比赛
- 重新索引
black_index
为(nameb_id,锦标赛ID,日期)
- 重新索引
white_index
作为(namew_id,比赛ID,日期)
- 放下
w_b_t_d_index
除非您在其他查询中使用 - 从
count(*)
查询 $ b $中删除无用的
coalesce
b - create table
names
- replace
namew
andnameb
withnamew_id
andnameb_id
, both referencingnames
- create table
tournaments
- replace
tournament
withtournament_id
referencingtournaments
- reindex
black_index
as(nameb_id, tournament_id, date)
- reindex
white_index
as(namew_id, tournament_id, date)
- drop
w_b_t_d_index
unless you use it in some other query - remove the useless
coalesce
from thecount(*)
query
您的查询应如下所示:
SELECT
(
SELECT count(*)
FROM chess_data t2
WHERE
t1.namew_id = t2.namew_id AND
t1.tournament_id = t2.tournament_id AND
t1.date > t2.date AND
t1.date < t2.date + 90
)
+
(
SELECT count(*)
FROM chess_data2 t2
WHERE
t1.namew_id = t2.nameb_id AND
t1.tournament_id = t2.tournament_id AND
t1.date > t2.date AND
t1.date < t2.date + 90
) AS games_t_w
FROM chess_data2 t1
这篇关于大型数据集上sum(column_name),sum(1)和count(*)之间的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!