PostgreSQL-获取具有列的最大值的行 [英] PostgreSQL - fetch the row which has the Max value for a column

查看:156
本文介绍了PostgreSQL-获取具有列的最大值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个Postgres表(称为"lives"),该表包含带有time_stamp,usr_id,transaction_id和live_remaining列的记录.我需要一个查询,该查询将为我提供每个usr_id

I'm dealing with a Postgres table (called "lives") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that will give me the most recent lives_remaining total for each usr_id

  1. 有多个用户(与usr_id不同)
  2. time_stamp不是唯一的标识符:有时,用户事件(表中的每一行)将使用相同的time_stamp发生.
  3. trans_id仅在很小的时间范围内是唯一的:随着时间的推移,它会重复
  4. remaining_lives(对于给定用户)可以随时间增加和减少

示例:

time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
  07:00  |       1       |   1  |   1    
  09:00  |       4       |   2  |   2    
  10:00  |       2       |   3  |   3    
  10:00  |       1       |   2  |   4    
  11:00  |       4       |   1  |   5    
  11:00  |       3       |   1  |   6    
  13:00  |       3       |   3  |   1    

由于我将需要使用给定usr_id的每个给定数据访问该行的其他列,因此我需要一个查询,其结果如下:

As I will need to access other columns of the row with the latest data for each given usr_id, I need a query that gives a result like this:

time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
  11:00  |       3       |   1  |   6    
  10:00  |       1       |   2  |   4    
  13:00  |       3       |   3  |   1    

如前所述,每个usr_id可能会丧命或丧命,有时这些带有时间戳的事件发生得非常紧密,以至于它们具有相同的时间戳!因此,此查询将不起作用:

As mentioned, each usr_id can gain or lose lives, and sometimes these timestamped events occur so close together that they have the same timestamp! Therefore this query won't work:

SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM 
      (SELECT usr_id, max(time_stamp) AS max_timestamp 
       FROM lives GROUP BY usr_id ORDER BY usr_id) a 
JOIN lives b ON a.max_timestamp = b.time_stamp

相反,我需要同时使用time_stamp(第一)和trans_id(第二)来标识正确的行.然后,我还需要将该信息从子查询传递到主查询,该主查询将提供相应行的其他列的数据.这是我必须使用的修改过的查询:

Instead, I need to use both time_stamp (first) and trans_id (second) to identify the correct row. I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows. This is the hacked up query that I've gotten to work:

SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM 
      (SELECT usr_id, max(time_stamp || '*' || trans_id) 
       AS max_timestamp_transid
       FROM lives GROUP BY usr_id ORDER BY usr_id) a 
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id 
ORDER BY b.usr_id

好的,这可行,但是我不喜欢它.它需要一个查询中的一个查询,一个自我联接,在我看来,抓住MAX发现具有最大时间戳和trans_id的行可能会更简单.表"lives"具有数千万行要解析,因此我希望此查询尽可能快和高效.我是RDBM和Postgres的新手,所以我知道我需要有效地使用适当的索引.我对如何优化有些迷茫.

Okay, so this works, but I don't like it. It requires a query within a query, a self join, and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans_id. The table "lives" has tens of millions of rows to parse, so I'd like this query to be as fast and efficient as possible. I'm new to RDBM and Postgres in particular, so I know that I need to make effective use of the proper indexes. I'm a bit lost on how to optimize.

我发现了类似的讨论这里.我可以执行某种与Oracle分析功能等效的Postgres吗?

I found a similar discussion here. Can I perform some type of Postgres equivalent to an Oracle analytic function?

任何有关访问聚合函数(如MAX)使用的相关列信息,创建索引以及创建更好的查询的建议都将受到赞赏!

Any advice on accessing related column information used by an aggregate function (like MAX), creating indexes, and creating better queries would be much appreciated!

P.S.您可以使用以下内容创建我的示例案例:

P.S. You can use the following to create my example case:

create TABLE lives (time_stamp timestamp, lives_remaining integer, 
                    usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);

推荐答案

在具有158k伪随机行的表上(usr_id在0和10k之间均匀分布,trans_id在0和30之间均匀分布),

On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id uniformly distributed between 0 and 30),

下面,通过查询成本,我指的是Postgres基于成本的优化器的成本估算(具有Postgres的默认xxx_cost值),它是对所需I/O和CPU资源的加权函数估算;您可以通过启动PgAdminIII并在查询上运行查询/解释(F7)"并将查询/解释选项"设置为分析"来获得此信息.

By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"

  • Quassnoy的查询估计费用为745k(!),并在1.3秒内完成(给定(usr_idtrans_idtime_stamp)的复合索引)
  • 比尔的查询的成本估算为93k,并在2.9秒内完成(给定(usr_idtrans_id)的复合索引)
  • 下面的查询#1 的估算费用为16k,并在800ms内完成(鉴于(usr_idtrans_idtime_stamp)的复合索引)
  • 以下查询#2 的成本估算为14k,并在800ms内完成(给定(usr_idEXTRACT(EPOCH FROM time_stamp)trans_id)的复合函数索引)
    • 这是Postgres特有的
    • Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (usr_id, trans_id, time_stamp))
    • Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (usr_id, trans_id))
    • Query #1 below has a cost estimate of 16k, and completes in 800ms (given a compound index on (usr_id, trans_id, time_stamp))
    • Query #2 below has a cost estimate of 14k, and completes in 800ms (given a compound function index on (usr_id, EXTRACT(EPOCH FROM time_stamp), trans_id))
      • this is Postgres-specific

      上述所有时间都包括检索全部1万行结果集.

      All times above include retrieval of the full 10k rows result-set.

      您的目标是最小的成本估算最小的查询执行时间,并着重于估算的成本.查询执行可能严重依赖于运行时条件(例如相关行是否已经完全缓存在内存中),而成本估算却没有.另一方面,请记住,费用估算正是估算值.

      Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.

      在没有负载的专用数据库上运行时(例如,在开发PC上使用pgAdminIII),可以获得最佳的查询执行时间.查询时间将根据实际的机器负载/数据访问范围而在生产环境中有所不同.当一个查询的查询速度比另一个查询稍快(<20%),但成本却高得多时,选择一个执行时间较长但成本较低的查询通常是比较明智​​的.

      The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.

      如果您希望运行查询时生产机器上的内存没有竞争(例如,并发查询和/或文件系统活动不会破坏RDBMS缓存和文件系统缓存),则查询在独立模式(例如,开发PC上的pgAdminIII)下获得的时间将具有代表性.如果生产系统存在争用,查询时间将与估计的成本比率成比例地降低,因为成本较低的查询对缓存的依赖程度不高,而成本较高的查询将重新访问一遍又一遍地重复相同的数据(在没有稳定缓存的情况下触发其他I/O),例如:

      When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:

                    cost | time (dedicated machine) |     time (under load) |
      -------------------+--------------------------+-----------------------+
      some query A:   5k | (all data cached)  900ms | (less i/o)     1000ms |
      some query B:  50k | (all data cached)  900ms | (lots of i/o) 10000ms |
      

      创建必要的索引后,不要忘记运行一次ANALYZE lives.

      Do not forget to run ANALYZE lives once after creating the necessary indices.

      查询#1

      -- incrementally narrow down the result set via inner joins
      --  the CBO may elect to perform one full index scan combined
      --  with cascading index lookups, or as hash aggregates terminated
      --  by one nested index lookup into lives - on my machine
      --  the latter query plan was selected given my memory settings and
      --  histogram
      SELECT
        l1.*
       FROM
        lives AS l1
       INNER JOIN (
          SELECT
            usr_id,
            MAX(time_stamp) AS time_stamp_max
           FROM
            lives
           GROUP BY
            usr_id
        ) AS l2
       ON
        l1.usr_id     = l2.usr_id AND
        l1.time_stamp = l2.time_stamp_max
       INNER JOIN (
          SELECT
            usr_id,
            time_stamp,
            MAX(trans_id) AS trans_max
           FROM
            lives
           GROUP BY
            usr_id, time_stamp
        ) AS l3
       ON
        l1.usr_id     = l3.usr_id AND
        l1.time_stamp = l3.time_stamp AND
        l1.trans_id   = l3.trans_max
      

      查询#2

      -- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
      -- this results in a single table scan and one nested index lookup into lives,
      --  by far the least I/O intensive operation even in case of great scarcity
      --  of memory (least reliant on cache for the best performance)
      SELECT
        l1.*
       FROM
        lives AS l1
       INNER JOIN (
         SELECT
           usr_id,
           MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
             AS compound_time_stamp
          FROM
           lives
          GROUP BY
           usr_id
        ) AS l2
      ON
        l1.usr_id = l2.usr_id AND
        EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
        l1.trans_id = l2.compound_time_stamp[2]
      

      2013/01/29更新

      最后,从8.4版开始,Postgres支持 Window Function (窗口函数)意味着您可以编写如下简单而有效的内容:

      Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:

      查询#3

      -- use Window Functions
      -- performs a SINGLE scan of the table
      SELECT DISTINCT ON (usr_id)
        last_value(time_stamp) OVER wnd,
        last_value(lives_remaining) OVER wnd,
        usr_id,
        last_value(trans_id) OVER wnd
       FROM lives
       WINDOW wnd AS (
         PARTITION BY usr_id ORDER BY time_stamp, trans_id
         ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
       );
      

      这篇关于PostgreSQL-获取具有列的最大值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆