正确的方法来访问每个单独标识符的最新行? [英] Proper way to access latest row for each individual identifier?

查看:79
本文介绍了正确的方法来访问每个单独标识符的最新行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres中有一个表 core_message ,其中 百万 行看起来像这样(简化):

 ┌────────────┬ ────────────────┬────────┬── b──
│Colonne│类型│归类│可为空│票面价值│
├────────────────────────────── ────────┼────────────────┼──────┼ ────────────────────┤
│id│整数││不为null│nextval('core_message_id_seq' :: regclass)│
│mmsi│整数││不为空││
│时间│带时区的时间戳││不为空││
│点│地理位置(Point,4326)│ ││ │
└──────────────┴┴────────────┴ ────────────────┴──────────┴────────────── ──────────────────┘
索引:
core_message_pkey主键,btree(id)
core_message_uniq_mmsi_time唯一约束,btree(mmsi,时间)
core_messag_mmsi_b36d69_idx btree(mmsi,时间 DESC)
core_message_point_id要点(点)

mmsi 列是用于识别世界上船只的唯一标识符。我正在尝试获取每个 mmsi 的最新行。



例如,我可以这样:

 选择a。* FROM core_message a 
JOIN(SELECT mmsi,max(time)AS time FROM core_message GROUP通过mmsi)b
a.mmsi = b.mmsi和a.time = b.time;

但这太慢了,超过2秒。



所以我的解决方案是创建一个仅包含 core_message 表的最新行(最多 100K +行)的表,称为 LatestMessage



每次必须将新行添加到时,该表都会通过我的应用程序填充core_message



工作正常,我可以在几毫秒内访问表。
但是我很想知道是否有更好的方法仅使用一个表并保持相同的数据访问性能。

解决方案

此答案似乎妨碍了在此处直接显示,但是它也提到了这一点:


对于 很多每个客户的行
客户列中的基数低),一个 松散索引扫描 (又称跳过扫描)会提高
的效率(很多),但是尚未实现到Postgres12。
(正在为Postgres
13开发仅索引扫描的实现。请参见此处这里。)

目前,有更快的查询技术可以代替它。
特别是如果您有一个单独的
表拥有唯一客户,这是
的典型用法。但是如果您不这样做,也可以:




使用此其他好答案,我发现一种使用 LATERAL来保持与不同表相同性能的方法
通过使用新表 test_boats 我可以执行以下操作:

  CREATE TABLE test_boats AS(从core_message选择(mmsi)mmsi上的不同); 

此表创建需要40秒钟以上的时间,这与此处其他答案花费的时间非常相似。



然后,在 LATERAL 的帮助下:

 从test_boats中选择a.mmsi,b.time 
a
交叉加入横向(
从core_message b $ b中选择b.time
$ b WHERE a.mmsi = b.mmsi
按b.time DESC排序
LIMIT 1
)b LIMIT 10;

这非常快,超过1毫秒。



这将需要修改程序的逻辑并使用更复杂的查询,但我想我可以接受。



快速解决方案无需创建新表,请查看@ErwinBrandstetter 以下

$的
答案b
$ b


更新:我感到这个问题尚未得到完全回答,因为不清楚为什么其他解决方案提出了在这里表现不佳。



我尝试了此处中提到的基准。乍一看,如果您像基准测试中提出的那样进行请求,则 DISTINCT ON 方法似乎足够快:在我的计算机上+/- 30ms。
但这是因为该请求使用仅索引扫描。如果您在索引中未包含字段 some_column ,则性能会下降到+/- 100ms。



性能尚未大幅下降。
这就是为什么我们需要一个具有更大数据集的基准测试的原因。与我的情况类似:4万个客户和800万行。 在这里



让我们再试一次 DISTINCT ON 与此新表:

  SELECT DISTINCT ON(customer_id) id,customer_id,总计
,来自Purchases_more
ORDER BY customer_id,总DESC,id;

这大约需要1.5秒才能完成。

  SELECT DISTINCT ON(customer_id)* 
from Purchases_more
按customer_id排序,总DESC,id;

这大约需要35秒。



<现在回到上面的第一个解决方案。它使用仅索引扫描和 LIMIT ,这就是为什么扫描速度极快的原因之一。如果我修改该查询以不使用仅索引扫描并转储限制:

  SELECT b。* 
FROM test_boats a
交叉加入横向(
选择b。*
从core_message b
处a.mmsi = b.mmsi
依b.time DESC订购b LIMIT 1
)b;

这大约需要500毫秒,这仍然相当快。



有关排序的更深入基准,请参见下面的其他答案。下面。


I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──────────────────────────┬─────────────────┬───────────┬──────────────────────────────────────────┐
│    Colonne     │           Type           │ Collationnement │ NULL-able │                Par défaut                │
├────────────────┼──────────────────────────┼─────────────────┼───────────┼──────────────────────────────────────────┤
│ id             │ integer                  │                 │ not null  │ nextval('core_message_id_seq'::regclass) │
│ mmsi           │ integer                  │                 │ not null  │                                          │
│ time           │ timestamp with time zone │                 │ not null  │                                          │
│ point          │ geography(Point,4326)    │                 │           │                                          │
└────────────────┴──────────────────────────┴─────────────────┴───────────┴──────────────────────────────────────────┘
Index:
    "core_message_pkey" PRIMARY KEY, btree (id)
    "core_message_uniq_mmsi_time" UNIQUE CONSTRAINT, btree (mmsi, "time")
    "core_messag_mmsi_b36d69_idx" btree (mmsi, "time" DESC)
    "core_message_point_id" gist (point)

The mmsi column is a unique identifier used to identify ships in the world. I'm trying to get the latest row for each mmsi.

I can get that like this, for example:

SELECT a.* FROM core_message a
JOIN  (SELECT mmsi, max(time) AS time FROM core_message GROUP BY mmsi) b
       ON a.mmsi=b.mmsi and a.time=b.time;

But this is too slow, 2 seconds+.

So my solution was to create a distinct table containing only the latest rows (100K+ rows max) of the core_message table, called LatestMessage.

This table is populated via my application every time new rows have to be added to core_message.

It worked fine, I'm able to access the table in a matter of milliseconds. But I'd be curious to know if there is a better way to achieve that using only one table and keep the same level of performance for data access.

解决方案

This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :

For many rows per customer (low cardinality in column customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 12. (An implementation for index-only scans is in development for Postgres 13. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL. By using a new table test_boats I can do something like this :

 CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);

This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.

Then, with the help of LATERAL :

SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.time
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b LIMIT 10;

This is blazingly fast, 1+ millisecond.

This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.

For a fast solution without the need to create a new table, check out the answer of @ErwinBrandstetter below


UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.

I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer. But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.

Not a dramatic drop in performance yet. That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here

Let's try again the DISTINCT ON with this new table:

SELECT DISTINCT ON (customer_id) id, customer_id, total 
FROM purchases_more 
ORDER BY customer_id, total DESC, id;

This takes about 1.5 seconds to complete.

SELECT DISTINCT ON (customer_id) *
FROM purchases_more 
ORDER BY customer_id, total DESC, id;

This takes about 35 seconds to complete.

Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :

SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.*
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b;

This will take about 500ms, which is still pretty fast.

For a more in-depth benchmark of sort, see my other answer below.

这篇关于正确的方法来访问每个单独标识符的最新行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆