当索引扫描更好时,Postgres不使用索引 [英] Postgres not using index when index scan is much better option

查看:127
本文介绍了当索引扫描更好时,Postgres不使用索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的查询来连接两个速度非常慢的表。我发现查询计划会对大表 email_activities (〜1000万行)进行seq扫描,而我认为使用索引执行嵌套循环实际上会更快。

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.

我尝试使用子查询重写查询,以强制使用索引,然后发现一些有趣的东西。如果您查看下面的两个查询计划,您会看到当我将子查询的结果集限制为43k时,查询计划确实使用了email_activities的索引,而将子查询中的限制设置为甚至是44k,将导致查询计划使用seq scan on email_activities 。一个显然比另一个更有效,但是Postgres似乎并不在乎。

I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.

这是什么原因引起的?如果其中之一大于特定大小,它是否在某处有配置强制使用哈希联接?

What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
   ->  Nested Loop  (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
         ->  HashAggregate  (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
               ->  Limit  (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
         ->  Index Only Scan using index_email_activities_on_email_recipient_id on email_activities  (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
               Index Cond: (email_recipient_id = email_recipients.id)
               Heap Fetches: 40789
 Total runtime: 224.675 ms

并且:

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
   ->  Hash Semi Join  (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
         Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
         ->  Seq Scan on email_activities  (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
         ->  Hash  (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 1758kB
               ->  Limit  (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
 Total runtime: 3050.660 ms




  • 版本:x86_64上的PostgreSQL 9.3.10 -unknown-linux-gnu,由gcc编译(Ubuntu / Linaro 4.6.3-1ubuntu5)4.6.3,64位

  • 电子邮件活动:〜1000万行

  • 电子邮件收件人:〜1100万行

  • 推荐答案

    索引扫描->位图索引扫描- >顺序扫描



    对于少数行,运行索引扫描需要付费。随着要返回的行更多(表的百分比更高,并且取决于数据分布,值频率和行宽度),将更有可能在一个数据页上找到几行。然后需要切换到位图索引扫描。一旦必须访问大量数据页,运行顺序扫描,过滤多余的行并完全跳过索引的开销会更便宜。

    Index scan -> bitmap index scan -> sequential scan

    For few rows it pays to run an index scan. With more rows to return (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.

    Postgres切换到顺序扫描,期望找到 rows = 263962 ,已经占整个表的3%。 (虽然实际上仅找到 rows = 47935 ,请参阅下文。)

    Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935 are actually found, see below.)

    此相关答案的更多信息:

    More in this related answer:

    • Efficient PostgreSQL query on timestamp using index or bitmap index scan?

    您不能直接在Postgres中强制使用某个计划程序方法,但是您可以使 other 方法对于调试而言似乎非常昂贵。请参见计划器方法配置

    You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.

    SET enable_seqscan = off (如另一个答案中所建议的那样)可用于顺序扫描。但这仅用于会话中的调试目的。除非您确切知道自己在做什么,否则不要 不要 将其用作生产中的常规设置。它可以强制执行荒谬的查询计划。 引用手册

    SET enable_seqscan = off (like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. Quoting the manual:


    这些配置参数提供了一种粗略的方法来影响
    查询优化器选择的查询计划。如果优化器为特定查询选择的默认计划
    不是最佳选择,则
    临时解决方案是使用以下配置参数之一来强制
    选择不同的计划。改善
    优化程序选择的计划质量的更好方法包括调整
    计划者成本常量(请参见第18.7.2节),运行 ANALYZE 手动,
    增加 default_statistics_target 配置
    参数,并使用 ALTER TABLE SET STATISTICS 增加针对
    特定列收集的统计信息。

    These configuration parameters provide a crude method of influencing the query plans chosen by the query optimizer. If the default plan chosen by the optimizer for a particular query is not optimal, a temporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to improve the quality of the plans chosen by the optimizer include adjusting the planer cost constants (see Section 18.7.2), running ANALYZE manually, increasing the value of the default_statistics_target configuration parameter, and increasing the amount of statistics collected for specific columns using ALTER TABLE SET STATISTICS.

    这已经是您需要的大多数建议。

    That's already most of the advice you need.

    • Keep PostgreSQL from sometimes choosing a bad query plan

    在这种特殊情况下,Postgres期望 email_activities.email_recipient_id 的命中次数是实际发现次数的5-6倍:

    In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than are actually found:


    估计的行= 227007 actual ...行= 40789

    估计的行= 263962 actual ...行= 47935

    如果您经常运行此查询,则需要花费 ANALYZE 查看更大的样本以获取特定列上更准确的统计信息。您的表很大(〜1000万行),所以要做到:

    If you run this query often it will pay to have ANALYZE look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:

    ALTER TABLE email_activities ALTER COLUMN email_recipient_id
    SET STATISTICS 3000;  -- max 10000, default 100
    

    然后 ANALYZE email_activities;

    非常罕见的情况下,您可能可以在单独的事务或具有其自身环境的函数中使用 SET LOCAL SET_CALAL SET_CAL = c_off 强制建立索引。像这样:

    In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off in a separate transaction or in a function with its own environment. Like:

    CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)
      RETURNS bigint AS
    $func$
       SELECT COUNT(DISTINCT a.email_recipient_id)
       FROM   email_activities a
       WHERE  a.email_recipient_id IN (
          SELECT id
          FROM   email_recipients
          WHERE  email_campaign_id = $1
          LIMIT  $2)       -- or consider query below
    $func$  LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;

    此设置仅适用于该函数的本地范围。

    The setting only applies to the local scope of the function.

    警告:仅仅是概念的证明。从长远来看,即使是这种不太激进的手动干预也可能会咬你。基数,值频率,您的架构,全局Postgres设置都会随着时间而变化。您将要升级到新的Postgres版本。您现在强制执行的查询计划,以后可能会变成一个非常糟糕的主意。

    Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.

    通常,这只是解决设置问题的一种解决方法。最好找到并修复它。

    And typically this is just a workaround for a problem with your setup. Better find and fix it.

    该问题中缺少基本信息,但该等效查询可能会更快,并且更有可能在( email_recipient_id )上使用索引-越来越多的 LIMIT 会越来越多。

    Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.

    SELECT COUNT(*) AS ct
    FROM  (
       SELECT id
       FROM   email_recipients
       WHERE  email_campaign_id = 1607
       LIMIT  43000
       ) r
    WHERE  EXISTS (
       SELECT 1
       FROM   email_activities
       WHERE  email_recipient_id = r.id);
    

    这篇关于当索引扫描更好时,Postgres不使用索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆