在单个查询中获取分页行和总计数 [英] Get paginated rows and total count in single query

查看:26
本文介绍了在单个查询中获取分页行和总计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

核心要求:
submission_date为指定的过滤条件typeplanstatusperson_id的最新条目代码>.可能有更多这样的过滤器,但无论如何按提交日期返回最新的逻辑是相同的.两种主要用途,一种用于在 UI 中进行分页查看,另一种用于生成报告.

Core requirement:
Find latest entries for a person_id by submission_date for specified filter criteria type, plan, status. There could be more such filters, but the logic to return latest by submission date is the same regardless. Two major uses one for paginated viewing in UI and second for generating reports.

WITH cte AS (
  SELECT * FROM (
    SELECT my_table.*, rank() OVER (PARTITION BY person_id ORDER BY submission_date DESC, last_updated DESC, id DESC) FROM my_table
    )  rank_filter 
      WHERE RANK=1 AND status in ('ACCEPTED','CORRECTED') AND type != 'CR' AND h_plan_id IN (10000, 20000)
)
SELECT
SELECT count(id) FROM cte group by id,
SELECT * FROM cte limit 10 offset 0;

group by 也不适用于 CTE.计数查询中所有 null 的联合可能适用于组合,但不确定.

The group by also does not work on a CTE. a union with all null in the count query might work for combining probably, but not sure.

我想将这两个合并为 1 个查询的主要原因是因为表很大并且窗口函数很昂贵.目前我使用单独的查询,它们基本上运行相同的查询两次.

The main reason I want to combine these two into 1 query is because the table is big and the window function is expensive. Currently I use separate queries which both basically run the same query twice.

Postgres 版本 12.

Postgres version 12.

\d my_table;
                               Table "public.my_table"
                 Column   |            Type             | Collation | Nullable 
--------------------------+-----------------------------+-----------+----------
 id                       | bigint                      |           | not null 
 h_plan_id                | bigint                      |           | not null 
 h_plan_submitter_id      | bigint                      |           |          
 last_updated             | timestamp without time zone |           |          
 date_created             | timestamp without time zone |           |          
 modified_by              | character varying(255)      |           |          
 segment_number           | integer                     |           |          

 -- <bunch of other text columns>

 submission_date          | character varying(255)      |           |          
 person_id                | character varying(255)      |           |          
 status                   | character varying(255)      |           |          
 file_id                  | bigint                      |           | not null 
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)
    "my_table_file_idx" btree (file_id)
    "my_table_hplansubmitter_idx" btree (h_plan_submitter_id)
    "my_table_key_hash_idx" btree (key_hash)
    "my_table_person_id_idx" btree (person_id)
    "my_table_segment_number_idx" btree (segment_number)
Foreign-key constraints:
    "fk38njesaryvhj7e3p4thqkq7pb" FOREIGN KEY (h_plan_id) REFERENCES health_plan(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fk6by9668sowmdob7433mi3rpsu" FOREIGN KEY (h_plan_submitter_id) REFERENCES h_plan_submitter(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fkb06gpo9ng6eujkhnes0eco7bj" FOREIGN KEY (file_id) REFERENCES x12file(id) ON UPDATE CASCADE ON DELETE CASCADE

附加信息type 的可能值是 ENCR,其中 EN 约占数据的 70%.表格列宽 select avg_width from pg_stats where tablename='mytable'; 总共 374 列 41 列,所以每列大约 9 个.

Additional information Possible values for type are EN and CR with EN being about 70% of the data. Table column widths select avg_width from pg_stats where tablename='mytable'; is a total of 374 for 41 columns so about 9 per col.

我们的想法是向用户显示一些页面,然后他们可以通过附加参数进行过滤,例如 file_name(每个文件通常有大约 5k 个条目)、type(非常低基数)、member_add_id(高基数)、plan_id(低基数,每 500k 到一百万个条目将关联到一个计划 ID).在所有情况下,业务需求是仅显示 submission_date 的特定计划 ID 集的最新记录(对于每年完成的报告).按 id 排序只是防御性编码,同一天可以有多个条目,即使有人编辑了倒数第二个条目,因此触及 last_updated 时间戳,我们只想显示相同的最后一个条目数据.这可能永远不会发生并且可以删除.

The idea is to show some pages upfront to user, they can then filter by additional parameters like file_name(each file usually has about 5k entries), type(very low cardinality), member_add_id(high cardinality), plan_id(low cardinality, every 500k to a million entries will be associated to a plan id). The business requirement in all cases is to show just the latest record for a certain set of plan id's for a submission_date(for reports it is done per year). The order by id was just defensive coding, the same day can have multiple entries and even if someone edited the second last entry hence touching the last_updated timestamp, we want to show only the very last entry of the same data. This probably never happens and can be removed.

用户可以使用这些数据生成 csv 报告.

User can use this data to generate csv reports.

右连接查询的解释结果如下:

Result of explain for the query with right join below:

 Nested Loop Left Join  (cost=554076.32..554076.56 rows=10 width=17092) (actual time=4530.914..4530.922 rows=10 loops=1)
   CTE cte
     ->  Unique  (cost=519813.11..522319.10 rows=495358 width=1922) (actual time=2719.093..3523.029 rows=422638 loops=1)
           ->  Sort  (cost=519813.11..521066.10 rows=501198 width=1922) (actual time=2719.091..3301.622 rows=423211 loops=1)
                 Sort Key: mytable.person_id, mytable.submission_date DESC NULLS LAST, mytable.last_updated DESC NULLS LAST, mytable.id DESC
                 Sort Method: external merge  Disk: 152384kB
                 ->  Seq Scan on mytable  (cost=0.00..54367.63 rows=501198 width=1922) (actual time=293.953..468.554 rows=423211 loops=1)
                       Filter: (((status)::text = ANY ('{ACCEPTED,CORRECTED}'::text[])) AND (h_plan_id = ANY ('{1,2}'::bigint[])) AND ((type)::text <> 'CR'::text))
                       Rows Removed by Filter: 10158
   ->  Aggregate  (cost=11145.56..11145.57 rows=1 width=8) (actual time=4142.116..4142.116 rows=1 loops=1)
         ->  CTE Scan on cte  (cost=0.00..9907.16 rows=495358 width=0) (actual time=2719.095..4071.481 rows=422638 loops=1)
   ->  Limit  (cost=20611.67..20611.69 rows=10 width=17084) (actual time=388.777..388.781 rows=10 loops=1)
         ->  Sort  (cost=20611.67..21850.06 rows=495358 width=17084) (actual time=388.776..388.777 rows=10 loops=1)
               Sort Key: cte_1.person_id
               Sort Method: top-N heapsort  Memory: 30kB
               ->  CTE Scan on cte cte_1  (cost=0.00..9907.16 rows=495358 width=17084) (actual time=0.013..128.314 rows=422638 loops=1)
 Planning Time: 0.369 ms
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.947 ms, Inlining 4.983 ms, Optimization 178.469 ms, Emission 110.261 ms, Total 295.660 ms
 Execution Time: 4587.711 ms

推荐答案

第一件事:你在同一个查询中多次使用 CTE 的结果,那就是一个主要的 CTE 的功能.)你所拥有的会像这样工作(虽然仍然只使用一次 CTE):

First things first: you can use results from a CTE multiple times in the same query, that's a main feature of CTEs.) What you have would work like this (while still using the CTE once only):

WITH cte AS (
   SELECT * FROM (
      SELECT *, row_number()  -- see below
                OVER (PARTITION BY person_id
                      ORDER BY submission_date DESC NULLS LAST  -- see below
                             , last_updated DESC NULLS LAST  -- see below
                             , id DESC) AS rn
      FROM  tbl
      ) sub
   WHERE  rn = 1
   AND    status IN ('ACCEPTED', 'CORRECTED')
   )
SELECT *, count(*) OVER () AS total_rows_in_cte
FROM   cte
LIMIT  10
OFFSET 0;  -- see below

警告 1:rank()

rank() 可以为每个 person_id 返回多行,rank = 1.DISTINCT ON (person_id)(如 Gordon 提供的)是 row_number() 的适用替代品 - 对您有用,附加信息已澄清.见:

Caveat 1: rank()

rank() can return multiple rows per person_id with rank = 1. DISTINCT ON (person_id) (like Gordon provided) is an applicable replacement for row_number() - which works for you, as additional info clarified. See:

submission_datelast_updated 都没有被定义为 NOT NULL.可能是 ORDER BY submit_date DESC, last_updated DESC ... 的问题,请参阅:

Neither submission_date nor last_updated are defined NOT NULL. Can be an issue with ORDER BY submission_date DESC, last_updated DESC ... See:

那些列真的应该NOT NULL吗?

您回复:

是的,所有这些列都应该是非空的.我可以添加该约束.我认为它可以为空,因为我们在文件中获取数据并不总是完美的.但这是非常罕见的情况,我可以用空字符串代替.

Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and I can put in empty string instead.

类型 date 不允许使用空字符串.保持列可以为空.NULL 是这些情况的正确值.按照演示使用 NULLS LAST 以避免 NULL 排在最前面.

Empty strings are not allowed for type date. Keep the columns nullable. NULL is the proper value for those cases. Use NULLS LAST as demonstrated to avoid NULL being sorted on top.

如果 OFFSET 等于或大于 CTE 返回的行数,则没有行,因此也没有总数.见:

If OFFSET is equal or greater than the number of rows returned by the CTE, you get no row, so also no total count. See:

解决到目前为止的所有警告,并根据添加的信息,我们可能会得出以下查询:

Addressing all caveats so far, and based on added information, we might arrive at this query:

WITH cte AS (
   SELECT DISTINCT ON (person_id) *
   FROM   tbl
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   ORDER  BY person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC
   )
SELECT *
FROM  (
   TABLE  cte
   ORDER  BY person_id  -- ?? see below
   LIMIT  10
   OFFSET 0
   ) sub
RIGHT  JOIN (SELECT count(*) FROM cte) c(total_rows_in_cte) ON true;

现在 CTE 实际上使用了两次.RIGHT JOIN 保证我们得到总数,不管OFFSET.DISTINCT ON 应该对基本查询中每个 (person_id) 的仅有几行执行 OK-ish.

Now the CTE is actually used twice. The RIGHT JOIN guarantees we get the total count, no matter the OFFSET. DISTINCT ON should perform OK-ish for the only few rows per (person_id) in the base query.

但是你有很宽的行.平均宽度是多少?查询可能会导致对整个表进行顺序扫描.索引不会有帮助(很多).所有这些都将分页效率极低.见:

But you have wide rows. How wide on average? The query will likely result in a sequential scan on the whole table. Indexes won't help (much). All of this will remain hugely inefficient for paging. See:

您不能涉及分页索引,因为它基于来自 CTE 的派生表.而且您的实际分页排序标准仍然不清楚(ORDER BY id ?).如果分页是目标,则您迫切需要不同的查询样式.如果您只对前几页感兴趣,那么您还需要不同的查询样式.最佳解决方案取决于问题中仍然缺少的信息......

You cannot involve an index for paging as that is based on the derived table from the CTE. And your actual sort criteria for paging is still unclear (ORDER BY id ?). If paging is the goal, you desperately need a different query style. If you are only interested in the first few pages, you need a different query style, yet. The best solution depends on information still missing in the question ...

为了您更新的目标:

submission_date

(为简单起见,忽略对于指定的过滤条件、类型、计划、状态".)

(Ignoring "for specified filter criteria, type, plan, status" for simplicity.)

还有:

仅当具有 status IN ('ACCEPTED','CORRECTED')

基于这两个专门的指标:

CREATE INDEX ON tbl (submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST)
WHERE  status IN ('ACCEPTED', 'CORRECTED'); -- optional

CREATE INDEX ON tbl (person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST);

运行此查询:

WITH RECURSIVE cte AS (
   (
   SELECT t  -- whole row
   FROM   tbl t
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   AND    NOT EXISTS (SELECT FROM tbl
                      WHERE  person_id = t.person_id 
                      AND   (  submission_date,   last_updated,   id)
                          > (t.submission_date, t.last_updated, t.id)  -- row-wise comparison
                      )
   ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT t1  -- whole row
           FROM   tbl t1
           WHERE ( t1.submission_date, t1.last_updated, t1.id)
               < ((t).submission_date,(t).last_updated,(t).id)  -- row-wise comparison
           AND    t1.status IN ('ACCEPTED', 'CORRECTED')
           AND    NOT EXISTS (SELECT FROM tbl
                              WHERE  person_id = t1.person_id 
                              AND   (   submission_date,    last_updated,    id)
                                  > (t1.submission_date, t1.last_updated, t1.id)  -- row-wise comparison
                              )
           ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (t).id IS NOT NULL
   )
SELECT (t).*
FROM   cte
LIMIT  10
OFFSET 0;

这里的每组括号都是必需的.

Every set of parentheses here is required.

这种复杂程度应该通过使用给定的索引而不是顺序扫描从根本上更快地检索一组相对较小的顶行.见:

This level of sophistication should retrieve a relatively small set of top rows radically faster by using the given indices and no sequential scan. See:

submission_date 最有可能是 timestamptzdate 类型,而不是 character variables(255) - 在任何情况下,这在 Postgres 中都是一个奇怪的类型定义.见:

submission_date should most probably be type timestamptz or date, not character varying(255) - which is an odd type definition in Postgres in any case. See:

可能会优化更多细节,但这已经失控.您可以考虑专业咨询.

Many more details might be optimized, but this is getting out of hands. You might consider professional consulting.

这篇关于在单个查询中获取分页行和总计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆