指望有条件的大表联接很慢 [英] Count on join of big tables with conditions is slow

查看:60
本文介绍了指望有条件的大表联接很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该查询在表较小时具有合理的时间.我正在尝试找出瓶颈,但不确定如何分析 EXPLAIN 结果.

This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN results.

SELECT
  COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
  (size > 0) AND
  total_sales.customer_id IN (
    SELECT customers.id FROM customers WHERE customers.active = 't'
    AND customers.visible = 't' AND customers.organization_id = 3
  ) AND
  total_sales.product_category_id IN (
    SELECT product_categories.id FROM product_categories
    WHERE product_categories.organization_id = 3
  ) AND
  total_sales.period_id = 193;

我尝试了使用INNER JOIN的方法来创建 customers product_categories 表并进行INNER SELECT.两者都在同一时间.

I've tried both the approach of INNER JOIN'ing customers and product_categories tables and doing an INNER SELECT. Both had the same time.

以下是指向EXPLAIN的链接: https://explain.depesz.com/s/9lhr

Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr

Postgres版本:

Postgres version:

x86_64-unknown-linux-gnu上的PostgreSQL 9.4.5,由gcc(GCC)4.8.2 20140120(Red Hat 4.8.2-16)编译,64位

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

表和索引:

CREATE TABLE total_sales (
  id serial NOT NULL,
  value double precision,
  start_date date,
  end_date date,
  product_category_customer_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  processed boolean,
  customer_id integer,
  product_category_id integer,
  period_id integer,
  CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);


CREATE TABLE performance_analyses (
  id serial NOT NULL,
  total_sales_id integer,
  status_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  size double precision,
  period_size integer,
  nominal_variation double precision,
  percentual_variation double precision,
  relative_performance double precision,
  time_ago_max integer,
  deseasonalized_series text,
  significance character varying,
  relevance character varying,
  original_variation double precision,
  last_level double precision,
  quantiles text,
  range text,
  analysis_method character varying,
  CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);


CREATE TABLE product_categories (
  id serial NOT NULL,
  name character varying,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  external_id character varying,
  CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);


CREATE TABLE customers (
  id serial NOT NULL,
  name character varying,
  external_id character varying,
  region_id integer,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  active boolean DEFAULT false,
  visible boolean DEFAULT false,
  segment_id integer,
  "group" boolean,
  group_id integer,
  ticket_enabled boolean DEFAULT true,
  CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);    
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);

行数:

  • 客户-6,970行
  • product_categories-34行
  • performance_analyses-1,012,346行
  • 总销售额-7,104,441行

推荐答案

您的查询(已重写且等同于100%):

Your query, rewritten and 100 % equivalent:

SELECT count(*)
FROM   product_categories   pc 
JOIN   customers            c  USING (organization_id) 
JOIN   total_sales          ts ON ts.customer_id = c.id
JOIN   performance_analyses pa ON pa.total_sales_id = ts.id
WHERE  pc.organization_id = 3
AND    c.active  -- boolean can be used directly
AND    c.visible
AND    ts.product_category_id = pc.id
AND    ts.period_id = 193
AND    pa.size > 0;

另一个答案建议将所有条件移到 FROM 列表中的连接子句和顺序表中.这可能适用于具有相对原始查询计划程序的某些其他RDBMS.但是,尽管它对Postgres也没有害处,但它对查询的性能也没有影响(假设使用默认服务器配置).手册:

Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:

显式内部联接语法( INNER JOIN CROSS JOIN 或未经修饰的 JOIN )在语义上与在 FROM 中列出输入关系相同,因此不限制加入顺序.

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.

强调粗体.还有更多内容,请阅读手册.

Bold emphasis mine. There is more, read the manual.

关键设置为 join_collapse_limit (默认为 8 ).无论您如何安排表格,以及是否将条件编写为 WHERE JOIN 子句,Postgres查询计划器都会以其希望最快的方式重新排列4个表.没什么区别.(对于某些其他类型的无法自由重排的联接,情况并非如此.)

The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)

重要的是,这些不同的连接可能性赋予了语义上相等的结果,但可能有很大的不同执行成本.因此,计划者将探索所有这些以尝试找到最有效的查询计划.

The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.

相关:

最后, WHERE ID IN(< subquery>)通常 not 等同于联接.对于右侧的重复匹配值,它不会在左侧乘以行.子查询的列在其余查询中不可见.联接可以将具有重复值的行相乘,并且可见列.
在这两种情况下,您的简单子查询都将查找一个唯一的列,因此在这种情况下没有有效的区别-除了 IN(< subquery>)通常(至少有点)较慢且更多之外详细.使用联接.

Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.

product_categories 有34行.除非您计划添加更多表,否则索引不会对该表产生帮助.顺序扫描将始终更快.删除 index_product_categories_on_organization_id .

product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop index_product_categories_on_organization_id.

客户 有6,970行.索引开始变得有意义.但是您的查询根据 EXPLAIN 输出使用了4,988个.索引上只有 仅索引扫描 宽于桌子可能会有所帮助.假设活跃且可见的是常量谓词,我建议使用部分多列索引:

customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

我附加了 id 以允许仅索引扫描.否则该列在该查询的索引中是无用的.

I appended id to allow index-only scans. The column is otherwise useless in the index for this query.

total_sales 有7,104,441行.索引非常重要.我建议:

total_sales has 7,104,441 rows. Indexes are very important. I suggest:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

再次,旨在进行仅索引扫描.这是最重要的.

Again, aiming for an index-only scan. This is the most important one.

您可以删除完全冗余的索引 index_total_sales_on_product_category_id .

You can delete the completely redundant index index_total_sales_on_product_category_id.

performance_analyses 有1,012,346行.索引非常重要.我建议另一个条件为 size>的部分索引.0 :

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

但是:

行已被过滤器删除:0"

Rows Removed by Filter: 0"

似乎这种情况没有用?是否有 size>的行?0 不是真的吗?

Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?

创建这些索引后,您需要对表进行 ANALYZE .

After creating these indexes you need to ANALYZE the tables.

通常,我看到许多错误的估计.Postgres 低估了几乎每一步返回的行数.我们看到的嵌套循环对于较少的行会更好地工作.除非这是不太可能的巧合,否则表统计信息将严重过时.您需要访问自动真空设置,可能还需要访问两个大表的每表设置 performance_analyses total_sales .

Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables performance_analyses and total_sales.

您已经运行了 VACUUM ANALYZE ,这使查询变慢了, vacuumdb-fz yourdb 在您的数据库上.那会在原始条件下重写所有表和索引,但是定期使用是不好的.这也很昂贵,而且会长时间锁定您的数据库!

You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!

在考虑这一点的同时,还要查看数据库的成本设置.相关:

While being at it, have a look at the cost settings of your DB as well. Related:

这篇关于指望有条件的大表联接很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆