如何加快PostgreSQL中的查询 [英] How to speed up the query in PostgreSQL

查看:94
本文介绍了如何加快PostgreSQL中的查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在PostgreSQL中拥有具有大数据的数据库(现在它的大小大约为46 GB,并且该数据库将继续增长)。我在常用列上创建了索引并修改了配置文件:

  shared_buffers = 1GB 
temp_buffers = 256MB
work_mem = 512MB

但是此查询仍然很慢:

 将不同的us_category_id选择为cat,从web_hits中将res(count_h_user_id)作为res 
内部加入用户,h_user_id = us_id
其中(h_datetime):: date =('2015-06-26'):: date和us_category_id!=''
group by us_category_id

解释分析:


  HashAggregate(成本= 2870958.72..2870958.93行= 21宽度= 9)(实际时间= 899141.683..899141.683行= 0循环= 1)

组密钥:users.us_category_id,count(web_hits.h_user_id)
-> HashAggregate(成本= 2870958.41..2870958.62行= 21宽度= 9)(实际时间= 899141.681..899141.681行= 0循环= 1)

组密钥:users.us_category_id
-> ;散列联接(cost = 5974.98..2869632.11行= 265259宽度= 9)(实际时间= 899141.679..899141.679行= 0循环= 1)

散列Cond:((web_hits.h_user_id):::文字=(users.us_id)::文字)
->对web_hits进行Seq扫描(成本= 0.00..2857563.80行= 275260宽度= 7)(实际时间= 899141.676..899141.676行= 0循环= 1)
->在web_hits上进行序列扫描(cost = 0.00..2857563.80行= 275260宽度= 7)(实际时间= 899141.676..899141.676行= 0循环= 1)
过滤器:((h_datetime):: date ='2015- 06-26':: date)

过滤器删除的行:55051918
->哈希(cost = 4292.99..4292.99行= 134559宽度= 10)(从未执行)
->对用户进行序列扫描(cost = 0.00..4292.99 rows = 134559 width = 10)(从未执行)
过滤器:((us_category_id):: text<>'':: text)

计划时间:1.309毫秒
执行时间:899141.789毫秒


日期已更改。
如何加快查询速度?



表和索引创建

  CREATE TABLE web_hits(
h_id integer NOT NULL DEFAULT nextval('w_h_seq':: regclass),
h_user_id字符变化,
h_datetime时间戳不带时区,
h_db_id字符变化,
h_voc_prefix字符变化,
...
h_bot_chek整数,-1-бот...
约束w_h_pk主键(h_id)
) ;
ALTER TABLE web_hits拥有者到postgres;
列上的注释web_hits.h_bot_chek是1-бот,0-небот;

在Web_hits上创建索引h_datetime(h_datetime);
CREATE INDEX h_db_index ON web_hits(h_db_id COLLATE pg_catalog。 default);
CREATE INDEX h_pref_index开启web_hits(h_voc_prefix COLLATE pg_catalog。默认 text_pattern_ops);
CREATE INDEX h_user_index ON web_hits(h_user_id text_pattern_ops);

CREATE TABLE用户(
us_id字符不为空,
us_category_id字符为不可变,
...
CONSTRAINT user_pk主键(us_id),
CONSTRAINT city_users_fk FOREIGN KEY(us_city_home)
参考城市(city_id),
CONSTRAINT country_users_fk FOREIGN KEY(us_country_home)
REFERENCES国家(country_id),
CONSTRAINT FORFORMES_users us_institution_id)
参考组织(org_id),
CONSTRAINT specialities_users_fk FOREIGN KEY(us_speciality_id)
REFERENCES专业(speciality_id),
CONSTRAINT us_affiliation FOREIGN KEY(us_org_id)$ b (org_id),
约束us_category外部密钥(us_category_id)
参考类别(cat_id),
CONSTRAINT us_reading_room外部密钥(us_reading_room_id)
参考URL阅读室(rr_id)$ b $ ;
ALTER TABLE用户拥有sveta;
对column users.us_type的注释为0-аноним,1-читатель,2-удаленный;

创建索引us_cat_index ON用户(us_category_id);
CREATE INDEX us_user_index ON用户(us_id text_pattern_ops);


解决方案

问题中缺少必要的信息。我将根据有根据的猜测来回答我的问题。
web_hits.h_user_id 有时为NULL,就像您在注释中添加的一样。



查询



基本上,在任何情况下,查询都可以简化/改进:

  SELECT u.us_category_id AS猫,count(*)AS res 
来自用户u
加入web_hits w ON上w.h_user_id = u.us_id
W.h_datetime> ='2015- 06-26 0:0':: timestamp
AND w.h_datetime< ‘2015-06-27 0:0’:: timestamp
和w.h_user_id不为空-删除无关的行,匹配索引
和u.us_category_id<> ’
GROUP BY 1;




  • DISTINCT 显然不必要以昂贵的方式使用,因为您已经通过us_category_id group分组了(例如 @Gordon已经提到过)。


  • 满足条件 可精 ,以便可以使用索引:




  • 因为您已经加入了列w.h_user_id ,从逻辑上讲,此列中所得的行为 NOT NULL count(*)在这种情况下是等效的,并且速度更快。


  • 条件 h_user_id不为空似乎是多余的,因为无论如何在 JOIN 中消除了NULL,但是它允许使用具有匹配条件的部分索引(


  • users.us_id (因此, web_hits.h_user_id )可能不应该具有数据类型 varchar 字符不同)。对于大型表格中的PK / FK列,这是一种低效的数据类型。使用数字数据类型,例如 int bigint (或 uuid (如果必须)。
    us_category_id 的类似考虑:应该为 integer 或相关。


  • 标准SQL不等式运算符是<> 。使用它代替也受支持的!=


  • 使用表限定符避免歧义-并且在




优化



进一步假设:




  • users.us_category_id<> 对于大多数行都是如此。

  • 具有 web_hits.h_user_id的大多数行或所有行都不为空



然后这会更快:

  SELECT u.us_category_id AS cat,sum(ct)AS res 
来自用户u
JOIN(
选择h_user_id,count(*)AS ct
FROM web_hits
h_datetime> ='2015-06-26 0:0':: timestamp
AND h_datetime<'2015-06-27 0:0':: timestamp
AND h_user_id不为空-删除不相关的行,匹配索引
GROUP BY 1
)w ON w.h_user_id = u.us_id
AND u.us_category_id<> ’
GROUP BY 1;



索引



无论哪种方式, 部分索引 最适合您的情况:



1。

 创建索引wh_usid_datetime_idx在web_hits(h_user_id, h_datetime)
h_user_id不为空;

消除其中 web_hits.h_user_id为NULL的行



按该顺序,而不是建议采用相反的方式。详细说明:





2。

 创建索引us_usid_cat_not_empty_idx ON用户(us_id)
WHERE us_category_id<> ’;

这会小很多,因为我们不会存储可能很长的索引中的varchar us_category_id -无论如何我们都不需要这种情况。我们只需要知道它的<> ’。如果您有整数列,则此考虑将不适用。



我们还排除了具有'' us_category_id 中的 NULL ,使索引变小了。



您必须权衡特殊索引的维护成本与它们的收益。如果您在匹配条件很多的情况下运行查询,它会付出代价,否则,它可能不会付出代价,并且更一般的索引可能总体上会更好。 b $ b

当然,关于性能优化的所有常规建议也都适用。



坦白说,关于您的查询的权利并不多,而且您的设置中有很多可疑项目。处理像您这样的大桌子,您可能会考虑寻求专业帮助。


I have DB in PostgreSQL with a big data (now it is somewhere around 46 GB and the db will keep growing). I created indexes on often used columns and adapted the config file:

shared_buffers = 1GB
temp_buffers = 256MB
work_mem = 512MB

But this query is still slow:

select distinct us_category_id as cat, count(h_user_id) as res from web_hits 
inner join users on h_user_id = us_id 
where (h_datetime)::date = ('2015-06-26')::date and us_category_id != ''
group by us_category_id

Explain Analyze:

HashAggregate (cost=2870958.72..2870958.93 rows=21 width=9) (actual time=899141.683..899141.683 rows=0 loops=1)

Group Key: users.us_category_id, count(web_hits.h_user_id)
-> HashAggregate (cost=2870958.41..2870958.62 rows=21 width=9) (actual time=899141.681..899141.681 rows=0 loops=1)

Group Key: users.us_category_id
 -> Hash Join (cost=5974.98..2869632.11 rows=265259 width=9) (actual time=899141.679..899141.679 rows=0 loops=1)

Hash Cond: ((web_hits.h_user_id)::text = (users.us_id)::text)
-> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1)
-> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1)
Filter: ((h_datetime)::date = '2015-06-26'::date)

Rows Removed by Filter: 55051918
-> Hash (cost=4292.99..4292.99 rows=134559 width=10) (never executed)
-> Seq Scan on users (cost=0.00..4292.99 rows=134559 width=10) (never executed)
Filter: ((us_category_id)::text <> ''::text)

"Planning time: 1.309 ms"
"Execution time: 899141.789 ms"

Date is changed. How can I speed up the query?

Table and index creating

CREATE TABLE web_hits (
  h_id integer NOT NULL DEFAULT nextval('w_h_seq'::regclass),
  h_user_id character varying,
  h_datetime timestamp without time zone,
  h_db_id character varying,
  h_voc_prefix character varying,
  ...
  h_bot_chek integer, -- 1-бот...
  CONSTRAINT w_h_pk PRIMARY KEY (h_id)
);
ALTER TABLE web_hits OWNER TO postgres;
COMMENT ON COLUMN web_hits.h_bot_chek IS '1-бот, 0-не бот';

CREATE INDEX h_datetime ON web_hits (h_datetime);
CREATE INDEX h_db_index ON web_hits (h_db_id COLLATE pg_catalog."default");
CREATE INDEX h_pref_index ON web_hits (h_voc_prefix COLLATE pg_catalog."default" text_pattern_ops);
CREATE INDEX h_user_index ON web_hits (h_user_id text_pattern_ops);

 CREATE TABLE users (
  us_id character varying NOT NULL,
  us_category_id character varying,
  ...
  CONSTRAINT user_pk PRIMARY KEY (us_id),
  CONSTRAINT cities_users_fk FOREIGN KEY (us_city_home)
      REFERENCES cities (city_id),
  CONSTRAINT countries_users_fk FOREIGN KEY (us_country_home)
      REFERENCES countries (country_id),
  CONSTRAINT organizations_users_fk FOREIGN KEY (us_institution_id)
      REFERENCES organizations (org_id),
  CONSTRAINT specialities_users_fk FOREIGN KEY (us_speciality_id)
      REFERENCES specialities (speciality_id),
  CONSTRAINT us_affiliation FOREIGN KEY (us_org_id)
      REFERENCES organizations (org_id),
  CONSTRAINT us_category FOREIGN KEY (us_category_id)
      REFERENCES categories (cat_id),
  CONSTRAINT us_reading_room FOREIGN KEY (us_reading_room_id)
      REFERENCES reading_rooms (rr_id)
);
ALTER TABLE users OWNER TO sveta;
COMMENT ON COLUMN users.us_type IS '0-аноним, 1-читатель, 2-удаленный';

CREATE INDEX us_cat_index ON users (us_category_id);
CREATE INDEX us_user_index ON users (us_id text_pattern_ops);

解决方案

Essential information is missing in the question. I am going to base parts of my answer on educated guesses. web_hits.h_user_id is sometimes NULL, like you added in the comment.

Query

Basically, the query can be simplified / improved to this in any case:

SELECT u.us_category_id AS cat, count(*) AS res
FROM   users    u
JOIN   web_hits w ON w.h_user_id = u.us_id
WHERE  w.h_datetime >= '2015-06-26 0:0'::timestamp
AND    w.h_datetime <  '2015-06-27 0:0'::timestamp
AND    w.h_user_id IS NOT NULL  -- remove irrelevant rows, match index
AND    u.us_category_id <> ''
GROUP  BY 1;

  • DISTINCT is obviously unnecessary in an expensive way, since you already group by us_category_id (like @Gordon already mentioned).

  • Make the conditions sargable so that an indexes can be used:

  • Since you have joined on the column w.h_user_id, it follows logically that resulting rows are NOT NULL in this column. count(*) is equivalent in this case and a bit faster.

  • The condition h_user_id IS NOT NULL seems redundant since NULL is eliminated in the JOIN anyway, but it allows to use the partial index with matching condition (see below).

  • users.us_id (and consequently web_hits.h_user_id) should probably not have the data type varchar (character varying). That's an inefficient data type for a PK / FK column in a huge table. Use a numeric data type like int or bigint (or uuid if you must). Similar considerations for us_category_id: should be integer or related.

  • The standard SQL inequality operator is <>. Use that instead of the also supported !=.

  • Use table qualification to avoid ambiguities - and in any case to make your query clear to the readers in a public forum.

Optimize

Further assuming that:

  • users.us_category_id <> '' is true for most rows.
  • Most or all rows with web_hits.h_user_id IS NOT NULL are counted.

Then this will be faster, yet:

SELECT u.us_category_id AS cat, sum(ct) AS res
FROM   users u
JOIN  (
   SELECT h_user_id, count(*) AS ct
   FROM   web_hits
   WHERE  h_datetime >= '2015-06-26 0:0'::timestamp
   AND    h_datetime <  '2015-06-27 0:0'::timestamp
   AND    h_user_id IS NOT NULL  -- remove irrelevant rows, match index
   GROUP  BY 1
   ) w ON w.h_user_id = u.us_id
AND    u.us_category_id <> ''
GROUP  BY 1;

Indexes

Either way, partial indexes are best for your case:

1.

CREATE INDEX wh_usid_datetime_idx ON web_hits(h_user_id, h_datetime)
WHERE  h_user_id IS NOT NULL;

Eliminate rows where web_hits.h_user_id IS NULL from the index.

Columns in that order, not the other way round like has been suggested. Detailed explanation:

2.

CREATE INDEX us_usid_cat_not_empty_idx ON users(us_id)
WHERE  us_category_id <> '';

This will be considerably smaller, because we don't store the potentially lengthy varchar column us_category_id in the index - which we don't need for the case anyway. We just need to know it's <> ''. If you had integer columns this consideration would not apply.

And we also exclude rows with '' or NULL in us_category_id, making the index smaller, yet.

You have to weigh maintenance cost for special indexes against their benefits. If you run queries with a matching condition a lot, it will pay, else, it might not, and more general indexes might be better overall.


Of course, all the usual advice on performance optimization applies, too.

Frankly, there is not very much right about your query and many items are suspicious in your setup. Dealing with huge tables like you obviously are, you might consider professional help.

这篇关于如何加快PostgreSQL中的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆