从分组列中采样重复值的最佳性能 [英] Best performance in sampling repeated value from a grouped column

查看:16
本文介绍了从分组列中采样重复值的最佳性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是关于first_value()的功能的a>,使用其他函数或解决方法.

This question is about the functionality of first_value(), using another function or workaround.

这也是关于大表中的性能略有提高".使用例如.max() 在下面解释的上下文中,需要虚假比较.即使速度很快,也会产生一些额外的成本.

It is also about "little gain in performance" in big tables. To use eg. max() in the explained context below, demands spurious comparisons. Even if fast, it imposes some additional cost.

这个典型的查询

SELECT x, y, count(*) as n 
FROM t 
GROUP BY x, y;

需要重复GROUP BY中的所有列以返回不止一列.这样做的语法糖是使用位置引用:

needs to repeat all columns in GROUP BY to return more than one column. A syntactic sugar to do this, is to use positional references:

SELECT x, y, count(*) as n 
FROM t 
GROUP BY x, 2  -- imagine that 2, 3, etc. are repeated with x

有时不仅需要糖,还需要一些语义来理解复杂的上下文:

Sometimes needs not only sugar, but also some semantic to understand complex context:

SELECT x, COALESCE(y,z), count(*) as n 
FROM t 
GROUP BY x, y, z  -- y and z are not "real need" grouping clauses?

我可以想象许多其他复杂的上下文.让我们看看通常的解决方案:

I can imagine many other complex contexts. Let's see usual solutions:

SELECT x, max(y) as y, count(*) as n 
FROM t 
GROUP BY x  -- best semantic! no need for other columns here

其中 max() 函数可以是任何sample()"(例如第一个或最后一个值).什么都不做的东西的性能比 max() 好,例如聚合函数first_value(),但它需要一个WINDOW,所以失去了性能.有一些旧建议在 C 中实现第一个/最后一个 agg 函数.

where max() function can be any "sample()" (eg. first or last value). The performance of something that do nothing is better than max(), e.g. the aggregate function first_value(), but it needs a WINDOW, so lost performance. There are some old suggestions to implement first/last agg functions in C.

是否有比 max()GROUP BY X,2,... 性能更好的快速获取任何一个值"聚合函数?
也许最近发布的一些新功能?

Is there some "get any one value fast" aggregate function with better performance than max() or GROUP BY X,2,...?
Perhaps some new feature in a recent release?

推荐答案

如果你真的不关心选择了集合的哪个成员,并且如果你不需要计算额外的聚合(比如计数),有DISTINCT ON (x) without ORDER BY:

If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

SELECT DISTINCT ON (x) x, y, z FROM t;

xyz 来自同一行,但该行是从具有相同x.

x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

如果您无论如何都需要计数,那么您在性能方面的选择是有限的,因为在任何一种情况下都必须读取整个表.不过,您可以在同一个 SELECT 中将其与窗口函数结合使用:

If you need a count anyway, your options with regard to performance are limited since the whole table has to be read in either case. Still, you can combine it with window functions in the same SELECT:

SELECT DISTINCT ON (x) x, y, z, count(*) OVER (PARTITION BY x) AS x_count FROM t;

考虑 SELECT 查询中的事件序列:

Consider the sequence of events in a SELECT query:

根据要求,可能有更快的方法来获得计数:

Depending on requirements, there may be faster ways to get counts:

结合 GROUP BY,我认为获得一些性能的唯一现实选择是 first_last_agg 扩展.但不要期望太多.

In combination with GROUP BY the only realistic option I see to gain some performance is the first_last_agg extension. But don't expect much.

对于其他不计其数的用例(包括顶部的简单用例),有更快的解决方案,具体取决于您的具体用例.特别是要获得每个集合的第一个"或最后一个"值.模拟松散索引扫描.(如@Mihai 评论):

For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case. In particular to get "first" or "last" value of each set. Emulate a loose index scan. (Like @Mihai commented):

这篇关于从分组列中采样重复值的最佳性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆