大型数据集上sum(column_name),sum(1)和count(*)之间的巨大性能差异 [英] Huge performance differences between sum(column_name), sum(1) and count(*) on a large dataset

查看:110
本文介绍了大型数据集上sum(column_name),sum(1)和count(*)之间的巨大性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:

由于你们建议为球员/锦标赛名称创建单独的表并用外键替换字符串,所以我做了以下操作:


Since you guys suggested creating separate tables for player/tournament names and replacing strings with foreign keys I did the following:

SELECT DISTINCT tournament INTO tournaments FROM chess_data2
ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY

我重复了namew和nameb的操作,然后用外键替换了字符串。这是棘手的地方-我无法在合法时间内做到这一点。

I repeated that for namew and nameb, then went about replacing strings with foreign keys. Here is where it got tricky - I am not able to do it in a "legit" time.

我尝试了以下两种方法:

1)删除现有索引

1)为namew创建单个索引,名称b和锦标赛分别

1)运行查询,将我想要的数据插入到新表中:

I tried two approaches was the following:
1) Delete existing indexes
1) Create individual indexes for namew, nameb and tournament separately
1) Run the query inserting the data I want into a new table:

SELECT date, whiterank, blackrank, t_round, result,
(SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep,
(SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp,
(SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament
INTO final_chess from chess_data2 c_d2

很不幸,它非常慢,所以我回到了用户Boris Shchegolev的身边。在评论中,他建议在现有表Chess_data2中创建一个新列并进行更新。所以我做到了:

Unfortunately it was very slow, so I came back to user Boris Shchegolev. In comment, he suggested creating a new column in the existing table chess_data2 and updating. So I did it:

ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER
UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)"

半个小时前,我开始了这些查询,第一个是即时的,但第二个是永远的。

I started those queries half an hour ago, the first one was instant but the second one takes forever.

我现在该怎么办?

初始问题:

数据库架构:

日期DATE

namew TEXT

nameb TEXT
whiterank INTEGER

blackrank INTEGER

比赛TEXT

t_round INTEGER < br>
结果REAL

id BIGINT

Chess_data2_pkey(id)

black_index(名称b,比赛,日期)

Chess_data2_pkey(id)唯一

w_b_t_d_index(namew,nameb,锦标赛,日期)

white_index(namew,锦标赛,日期)

Database schema:
date DATE
namew TEXT
nameb TEXT whiterank INTEGER
blackrank INTEGER
tournament TEXT
t_round INTEGER
result REAL
id BIGINT
chess_data2_pkey(id)
black_index (nameb, tournament, date)
chess_data2_pkey(id) UNIQUE
w_b_t_d_index (namew, nameb, tournament, date)
white_index (namew, tournament, date)

问题:

以下统计信息的性能水泥非常好(〜60-70秒。在具有3百万个条目的数据库中):

Problem:
The performance of the following statement is very good (~60-70 sec. in a database with 3 mln entries):

# Number of points that the white player has so far accrued throughout the tournament
(SELECT coalesce(SUM(result),0) from chess_data2 t2
where (t1.namew = t2.namew) and t1.tournament = t2.tournament
and t1.date > t2.date  and t1.date < t2.date + 90)
+ SELECT coalesce(SUM(1-result),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w
from chessdata2 t1

同时,以下选择(

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(count(*),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1

我尝试了另一种方法(加和),但也没有更好的结果:

I tried a different approach (with sum) and it didn't go better either:

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(sum(1),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1

您知道这里发生了什么以及如何解决吗?我正在PyCharm中使用python 3.5和psycopg2来运行这些查询。我将很高兴提供任何其他信息,因为这对我来说是一个非常重要的项目。

EXPLAIN ANALYZE (用于最后一个查询):

Any idea what's going on here and how to fix that? I'm using python 3.5 and psycopg2 in PyCharm to run those queries. I will be very happy to provide any additional information as it is a very important project for me.
EXPLAIN ANALYZE (Used for the last query):

Seq Scan on chess_data2 t1  (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1)
Planning time: 0.161 ms
Execution time: 81883.716 ms
SubPlan 2
SubPlan 1
->  Aggregate  (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
->  Aggregate  (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
      ->  Index Only Scan using white_index on chess_data2 t2  (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185)
      ->  Index Only Scan using black_index on chess_data2 t2_1  (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185)
            Rows Removed by Filter: 1
            Rows Removed by Filter: 1
            Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
            Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
            Heap Fetches: 6009767
            Heap Fetches: 5303160
            Filter: (t1.date < (date + 90))
            Filter: (t1.date < (date + 90))


推荐答案

由于表设计不佳,查询的性能不佳。从EXPLAIN可以明显看出,数据库使用索引,但索引字段都是 TEXT ,索引很大。

The queries are performing poorly due to poor table design. From the EXPLAIN it is obvious that the database uses indexes, but the the indexed fields are all TEXT and the indexes are huge.

要修复它:


  • 创建表名称

  • namew_id namew nameb c>和 nameb_id ,都引用名称

  • 创建表锦标赛

  • 锦标赛替换为 tournament_id 引用比赛

  • 重新索引 black_index (nameb_id,锦标赛ID,日期)

  • 重新索引 white_index 作为(namew_id,比赛ID,日期)

  • 放下 w_b_t_d_index 除非您在其他查询中使用

  • count(*)查询
  • $ b $中删除无用的 coalesce b
  • create table names
  • replace namew and nameb with namew_id and nameb_id, both referencing names
  • create table tournaments
  • replace tournament with tournament_id referencing tournaments
  • reindex black_index as (nameb_id, tournament_id, date)
  • reindex white_index as (namew_id, tournament_id, date)
  • drop w_b_t_d_index unless you use it in some other query
  • remove the useless coalesce from the count(*) query

您的查询应如下所示:

SELECT
    (
        SELECT count(*)
        FROM chess_data t2 
        WHERE
            t1.namew_id = t2.namew_id AND
            t1.tournament_id = t2.tournament_id AND
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    )
    +
    (
        SELECT count(*)
        FROM chess_data2 t2
        WHERE 
            t1.namew_id = t2.nameb_id AND
            t1.tournament_id = t2.tournament_id AND 
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    ) AS games_t_w
FROM chess_data2 t1

这篇关于大型数据集上sum(column_name),sum(1)和count(*)之间的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆