Postgresql与聚合 [英] Postgresql vs. aggregates

查看:63
本文介绍了Postgresql与聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有表的应用程序,该表可以跟踪具有size

属性的对象。我们想要做的是定期报告这些对象的数量

以及对象大小的总和。该表将

通常有1-5百万行。


我知道明显的选择计数(*),总和(大小).. "这不是一个好的想法,我们已经看到了这样的查询对性能的影响。


然后我们尝试使用object_count创建触发器以维护表

和sum_size列:


创建表汇总(object_count bigint,sum_size bigint)


填充表格只有一行,初始化为(0,0)。

触发器运行如下更新:


更新摘要

set object_count = object_count + 1,

sum_size = sum_size + new.size


问题是我们的应用程序有对象的句点

正在快速创建。对于汇总表的更新随着时间的推移逐渐减慢了

,并且由于vmstat报告了

而导致的CPU空闲率非常低。


我发现,如果我有一个线程,每隔几秒钟(例如30)抽真空汇总表,那么更新时间保持合理。我们开始重新考虑这种方法,想知道是否吸尘或

行版本的累积(在30秒内)落后于某些

我们正在看到的业绩变化。为此,我们以b
来衡量更频繁和更不频繁的影响。

吸尘。我会跳过这里的细节,因为这不是这封电子邮件的主要内容




我们的另一个想法是维护object_count和sum_size in

序列。触发器将使用currval和setval来维护

值。这个问题是当我们尝试检查另一个连接上的

序列时,我们不能使用currval来使用nextval碰撞

这个值会干扰维持

总计。 (序列s的错误消息是ERROR:s.currval

尚未在此会话中定义。)


另一个想法是编写C函数来维护
共享内存中的计数,模拟我们对序列不能做什么。


我明白为什么要求汇总值快速计算

更改数据集并不是一件完全有意义的事情。然而,我们

有要求。我们产品的用户需要能够查看

object_count和sum_size,以了解对象加载的方式

继续进行。


这不应该那么难。我已经看到,在这个邮件列表上,

建议不要那样做,使用触发器当有人想做的时候

选择计数(*)。但我还没有看到任何关于使用触发器的后果的讨论

使用触发器。


任何人都可以建议使用
触发器?每隔几秒吸尘真的是唯一的方法吗?


杰克奥伦斯坦

---------------- ------------------------------------------------ <无线电通信/>
此邮件是使用IMP,Internet Messaging Program发送的。


-------------------- -------(广播结束)---------------------------

提示2:您可以使用取消注册命令一次性取消所有列表

(发送取消注册YourEmailAddressHere到 ma *******@postgresql.org

I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.

I found that if I have a thread which vacuums the summary table every
few seconds (e.g. 30), then update times stay reasonable. We''re
starting to reconsider this approach, wondering if the vacuuming or
the buildup of row versions (in those 30 seconds) is behind some
variability in performance that we''re seeing. To that end, we are
measuring the impact of more frequent and less frequent
vacuuming. I''ll skip the details here since this is not the main point
of this email.

Another idea we had is to maintain object_count and sum_size in
sequences. The trigger would use currval and setval to maintain the
values. The problem with this is that when we try to examine the
sequence on another connection, we can''t use currval before bumping
the value with nextval, which would interfere with maintenance of the
totals. (The error message for a sequence s is "ERROR: s.currval
is not yet defined in this session".)

Yet another idea is to write C functions which maintain counts in
shared memory, simulating what we can''t quite do with sequences.

I understand why asking for aggregate values computed over rapidly
changing data sets is not a completely meaningful thing to do. Yet we
have the requirement. Users of our product need to be able to look at
object_count and sum_size to get a sense of how an object load is
proceeding.

This shouldn''t be so difficult. I''ve seen, on this mailing list,
the advice "don''t do that, use a trigger" when someone want to do
select count(*). But I haven''t seen any discussion of the consequences
of using a trigger.

Can anyone recommend a good way to maintain aggregates using a
trigger? Is vacuuming every few seconds really the only way to go?

Jack Orenstein
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

推荐答案

2004年4月3日星期三16:32, ja*@geophile.com 写道:
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
我有一个带有跟踪对象的表的应用程序大小属性。我们想要做的是定期报告这些对象的数量和对象大小的总和。该表通常有1-5百万行。

我知道明显的选择计数(*),总和(大小)......。这不是一个好主意,我们已经看到了这样一个查询对性能的影响。

然后我们尝试创建触发器来维护一个包含object_count
和sum_size列的表:

创建表汇总(object_count bigint,sum_size bigint)

我们用一行填充表,初始化为(0,0)。
触发器运行这样的更新:

更新摘要
设置object_count = object_count + 1,
sum_size = sum_size + new.size

问题是我们的应用程序具有快速创建对象的期间。随着时间的推移,摘要表的更新速度会慢下来,并且由于vmstat报告的速度非常低而导致CPU空闲率非常低。
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.




我想知道这些数据是否是时间敏感的,如果是这样的话,那么

如何敏感。如果他们可以说一分钟或者其他东西,使用

,某种物化视图在这里可能会很好用。或者临时表

来保存所有传入的数据,直到它需要被转储到真正的

表中。


一个想法,可能不适合你。


另外,你对相同的行有很多并行更新,或者是

他们所有不同的行?如果你一遍又一遍地对相同的

行进行大量更新,那么任何数据库都会有点慢,或者像在postgresql中那样用真空吸尘,或锁定为行级锁定

数据库。


如果它们各不相同,那么
$ b应该没有什么不同$ b postgresql比行锁定数据库,因为每个元组只有两个或三个版本,而不是一些较小百分比的行

有几百个版本。


只是想知道。

---------------------------(广播结束)---------------------------

提示7:别忘了增加你的费用免费空间地图设置



I''m wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.

Just a thought, might not work for you.

Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you''re doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they''re all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.

Just wondering.
---------------------------(end of broadcast)---------------------------
TIP 7: don''t forget to increase your free space map settings


2004年4月6日星期三16:32, ja*@geophile.com 写道:
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
我有一个应用程序,其中一个表跟踪具有大小属性的对象。我们想要做的是定期报告这些对象的数量和对象大小的总和。该表通常有1-5百万行。

我知道明显的选择计数(*),总和(大小)......。这不是一个好主意,我们已经看到了这样一个查询对性能的影响。

然后我们尝试创建触发器来维护一个包含object_count
和sum_size列的表:

创建表汇总(object_count bigint,sum_size bigint)

我们用一行填充表,初始化为(0,0)。
触发器运行这样的更新:

更新摘要
设置object_count = object_count + 1,
sum_size = sum_size + new.size

问题是我们的应用程序具有快速创建对象的期间。随着时间的推移,摘要表的更新速度会慢下来,并且由于vmstat报告的速度非常低而导致CPU空闲率非常低。
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.




我想知道这些数据是否是时间敏感的,如果是这样的话,那么

如何敏感。如果他们可以说一分钟或者其他东西,使用

,某种物化视图在这里可能会很好用。或者临时表

来保存所有传入的数据,直到它需要被转储到真正的

表中。


一个想法,可能不适合你。


另外,你对相同的行有很多并行更新,或者是

他们所有不同的行?如果你一遍又一遍地对相同的

行进行大量更新,那么任何数据库都会有点慢,或者像在postgresql中那样用真空吸尘,或锁定为行级锁定

数据库。


如果它们各不相同,那么
$ b应该没有什么不同$ b postgresql比行锁定数据库,因为每个元组只有两个或三个版本,而不是一些较小百分比的行

有几百个版本。


只是想知道。

---------------------------(广播结束)---------------------------

提示7:别忘了增加你的费用免费空间地图设置



I''m wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.

Just a thought, might not work for you.

Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you''re doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they''re all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.

Just wondering.
---------------------------(end of broadcast)---------------------------
TIP 7: don''t forget to increase your free space map settings


引用Scott Marlowe< sm ****** @ qwest.net>:
Quoting Scott Marlowe <sm******@qwest.net>:
在2004年6月6日星期三16:32, ja*@geophile.com 写道:
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
然后我们尝试创建trig使用object_count
和sum_size列维护表的gers:

创建表汇总(object_count bigint,sum_size bigint)

我们用一行填充表格,初始化为(0,0)。
触发器运行如下更新:

更新摘要
设置object_count = object_count + 1,
sum_size = sum_size + new .size

问题在于我们的应用程序具有快速创建对象的时间段。随着时间的推移,摘要表的更新速度会慢下来,而且由于vmstat报告的速度很慢,因此CPU空闲率非常低。
我想知道这些数据是否对时间敏感,如果是的话,如何敏感。如果他们可能会说一分钟或者其他东西,使用
某种物化视图可能会在这里运作良好。或者临时表来保存所有传入的数据,直到需要将其转储到真正的表格中。
Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.
I''m wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.




所以每分钟一次我运行存储过程来计算统计数据

并将行复制到主表中?这可能是可行的,但是然后

查询必须是一个结合主表和

临时表的视图。


你能扩展物化视图的使用来解决这个问题吗?

另外,你是否对同一行有很多并行更新,或者它们是否都是不同的行?如果你一遍又一遍地对相同的行进行大量更新,那么任何数据库都会有点慢,无论是对于postgresql进行抽真空,还是像在行级锁定一样进行锁定<数据库。

如果它们各不相同,那么对于postgresql而言,这应该与行锁定数据库没有区别,因为每个元组只会有
两个或三个版本,而不是一些较小百分比的行
有数百个版本。



So once a minute I''d run a stored procedure to compute statistics
and copy rows to the main table? That might be doable, but then
queries would have to be to a view combining the main table and
the temp table.

Can you expand on the use of materialized views to solve this?
Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you''re doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they''re all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.




摘要表只有一行,每当一个对象

被添加到主表中,这一个摘要行被更新。我的应用程序的一个奇怪的功能是,一次只能创建一个对象

。摘要行不是非MVCC数据库系统中的并发

热点。


但这提出了一个有趣的想法。假设我没有一个

汇总行,而是让我说,1000。当我的应用程序创建一个对象时,我选择一个汇总行(或循环)并更新

它。所以现在,而不是一行有很多版本,我有1000个1000x

每个版本更少。当我想要对象数量和大小时,我总结了1000个汇总行的

。这会让我保持性能

的摘要更新,减少吸尘频率吗?


如果我理解的话,我会发现所有这些都更容易推理如何组织和访问行的版本

。 postgresql怎么样?
找到一行的正确版本?


杰克奥伦斯坦


----- -------------------------------------------------- ---------

此邮件是使用互联网邮件程序IMP发送的。


--------- ------------------(广播结束)---------------------------

提示1:订阅和取消订阅命令转到 ma*******@postgresql.org



The summary table has exactly one row, and whenever an object
is added to the main table, this one summary row is updated. An
odd feature of my application is that there is only one object
creation going on at a time. The summary row would not be a concurrency
hotspot in a non-MVCC database system.

But that raises an interesting idea. Suppose that instead of one
summary row, I had, let''s say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I''d sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?

I''d find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?

Jack Orenstein

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org


这篇关于Postgresql与聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆