PostgreSQL中的日历表9 [英] Calendar tables in PostgreSQL 9

查看:594
本文介绍了PostgreSQL中的日历表9的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在建立一个分析数据库(我对数据和业务目标有着深刻的理解,只有基本到中等的数据库技能)。

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).

我有遇到一些引用建立类似的仓库,实施日历表的概念。这是有道理的,很容易做到。但是,我看到的大多数例子是将范围限制为天的日历表。我的数据需要分析到小时。可能是几分钟。

I have come across some references to building similar warehouses which implement the concept of 'calendar tables'. This makes sense and is easily enough done. Most examples I see, however, are calendar tables that limit scope to 'day'. My data will need to be analyzed down to hour-level. Possibly minutes.

我的问题:在空间效率和查询/排序速度方面,小时/分钟级粒度的日历表的执行是否有价值?如果是这样,你可以推荐一个表结构和种群方法/示例吗?

My question: would an implementation of calendar tables for hour/minute-level granularity be of value in terms of space-efficiency and query/sorting speed? If so, can you recommend a table structure and population method/example?

我的主数据表将在任何给定的时间包含20多万行的数据,典型的子集为分析在1到500万的范围内。所以,你可以看到,这是很多时间戳的字段。

My primary data table will contain 20+ million rows of data at any given time and typical subsets for analysis are in the 1 to 5 million range. So, as you can see, that is a lot of timestamp fields.

推荐答案

日历表实现空间/时间权衡。通过使用更多的空间,某些类型的查询在较少的时间内运行,因为它们可以利用索引。只要您注意CHECK()约束,并且只要您有管理流程来处理dbms不支持的任何约束,它们就是安全的。

Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

如果你的粒度是一分钟,你需要每年生成大约五十万行。一个最小的日历表将如下所示。

If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

如果你正在进行桶分析,那么你可能会喜欢这样的事情。 p>

If you're doing "bucket" analysis, you might be better off with something like this.

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

由于SQL的BETWEEN运算符包含端点,因此通常需要避免使用它。这是因为它包括端点,很难将bucket_end表示为bucket_start加上一分钟,减去服务器可以识别的最小时间。 (危险是一个比bucket_end大几微秒的值,但是仍然低于bucket_start的下一个值)。

Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

如果我要构建该表,可能这样做。 (虽然我会更加努力地考虑我是否应该称之为日历。)

If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

UNIQUE约束在PostgreSQL中创建一个隐式索引。

The UNIQUE constraint creates an implicit index in PostgreSQL.

此查询将一次插入一天的行数(24小时* 60分钟)。

This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

您可以在函数中包装一次生成一年。我可能会一次尝试提交不到五十万行。

You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

不应该花太多时间来生成2000万行进行测试,另外20个百万行日历分钟。午餐午餐也许是太阳下午了。

It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

这篇关于PostgreSQL中的日历表9的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆