如何在MM-DD上汇总来自多年的数据,而忽略年份 [英] How to aggregate data from multiple years on MM-DD, ignoring year

查看:120
本文介绍了如何在MM-DD上汇总来自多年的数据,而忽略年份的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Postgres版本9.4.18,PostGIS版本2.2.

以下是我正在使用的表(不太可能对表结构进行重大更改):

ltg_data (跨度1988年至2018年):

 Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")

ltg_data的大小(约800M行):

ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
 149729288192

counties :

  Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+--------------------------------- -----------------------
gid        | integer                     | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

所需结果: 我想要一个年份为每年的每一天的时间序列,格式为"MM-DD"而忽略年份: 01-01、01-02、01-03,...,12-31 .以及表ltg_data中一年中每一天的行数.我最终也希望在一年中的每一天的每个小时("MM-DD-HH")都使用相同的内容.

一个group by语句应该可以做到这一点,但是我很难将generate_series()生成的日期与大"表连接起来.

MM-DD  | total_count   
-------+------------
12-22  |       9
12-23  |       0
12-24  |       0
12-25  |       0
12-26  |      23
12-27  |       0
12-28  |       5
12-29  |       0
12-30  |       0
12-31  |       0

我的许多尝试查询中的一些:

SELECT date_trunc('day', d),
   count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
      ltg_data.lat
 FROM ltg_data
 JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
 WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;

但这并不能忽略年份.我不应该感到惊讶,因为date_trunc中的天"仍在考虑我猜的年份.

2017-12-27 00:00:00-08 |       0
2017-12-28 00:00:00-08 |       0
2017-12-29 00:00:00-08 |       0
2017-12-30 00:00:00-08 |       0
2017-12-31 00:00:00-08 |       0
2018-01-01 00:00:00-08 |       0
2018-01-02 00:00:00-08 |       12
2018-01-03 00:00:00-08 |       0

此查询中,我试图将数据以格式从generate_series()转换为text,以联接到text格式的ltg_data表中.说数据类型不匹配.我也尝试过extract,因为它可以提供"doy"和"hour",它们可以工作,但是我似乎也无法匹配该查询中的数据类型.很难使"generate_series"达到双精度.

SELECT to_char(d, 'MM-DD') AS DAY,
   count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d) 
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
      ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;

结果:

ERROR:  operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
         ^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.

结论: 我的目标是获取跨越多年的每日和每小时总计数,但按"MM-DD"和"MM-DD-HH"(忽略年份)分组,查询结果显示 整天/小时,即使它们为零 .

稍后,我还将尝试查找几天和几小时内的平均值和百分位数,因此,如果您对此有任何建议,我将不知所措.但是我当前的问题集中在仅仅获得总计的完整结果上.

解决方案

基本上,为了缩短年份, to_char(time, 'MMDD') 就像您已经尝试过的那样.您只是忘记了将其应用于加入generate_series() 之前生成的时间戳.还有其他一些次要细节.

为了简化并提高性能和方便性,我建议使用此简单函数根据给定timestamp的模式'MMDD'计算integer.

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';

我最初使用to_char(time, 'MMDD'),但是切换到上面的表达式,结果证明在各种测试中最快.

db<>小提琴此处

由于已定义IMMUTABLE,因此可以在表达式索引中使用.而且它仍然允许函数内联,因为它仅使用EXTRACT (xyz FROM date)-通过IMMUTABLE内部功能date_part(text, date). (请注意,datepart(text, timestamptz)仅为STABLE).

然后这种查询完成工作:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
   SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
   FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
                        , timestamp '2018-12-31'
                        , interval '1 day') d
   ) d
LEFT  JOIN (
   SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
   FROM   counties c
   JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
   WHERE  cwa = 'MFR'
   GROUP  BY 1
   ) ct USING (mmdd)
ORDER  BY 1;

因为time(我将使用其他列名)是数据类型timestamptz,所以强制转换time::date取决于当前会话的时区设置. (天"由您所在的时区定义.)要获得不可变(但较慢)的结果,请使用带有时区 name AT TIME ZONE构造,例如:

SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...

详细信息:

以任何您喜欢的显示方式格式化mmdd.

对于此特定查询,强制转换为integer是可选的.但是,由于您打算进行各种查询,因此最终需要在表达式上添加索引:

CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));

(查询不需要.)
integer为此目的要快一些. 并且您需要(否则为可选)函数包装器,因为to_char()仅定义为STABLE,但是我们需要IMMUTABLE作为索引.更新后的表达式(EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::intIMMUTABLE ,但是函数包装器仍然很方便.

相关:

Postgres version 9.4.18, PostGIS Version 2.2.

Here are the tables I'm working with (and can unlikely make significant changes to the table structure):

Table ltg_data (spans 1988 to 2018):

 Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")

Size of ltg_data (~800M rows):

ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
 149729288192

Table counties:

  Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+--------------------------------- -----------------------
gid        | integer                     | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

Desired result: I want a time series with one row for every day of the year in format 'MM-DD' ignoring the year: 01-01, 01-02, 01-03, ..., 12-31. And the count of rows in table ltg_data for each day of the year. I also eventually want the same thing for every hour of every day of the year ('MM-DD-HH').

A group by statement should accomplish this, but I'm having a hard time joining the "big" table with the days generated with generate_series().

MM-DD  | total_count   
-------+------------
12-22  |       9
12-23  |       0
12-24  |       0
12-25  |       0
12-26  |      23
12-27  |       0
12-28  |       5
12-29  |       0
12-30  |       0
12-31  |       0

Some of my many attempted queries:

SELECT date_trunc('day', d),
   count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
      ltg_data.lat
 FROM ltg_data
 JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
 WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;

But this doesn't ignore year. I shouldn't be surprised because the "day" in date_trunc is still considering the year I guess.

2017-12-27 00:00:00-08 |       0
2017-12-28 00:00:00-08 |       0
2017-12-29 00:00:00-08 |       0
2017-12-30 00:00:00-08 |       0
2017-12-31 00:00:00-08 |       0
2018-01-01 00:00:00-08 |       0
2018-01-02 00:00:00-08 |       12
2018-01-03 00:00:00-08 |       0

And this query, in which I'm trying to convert the data from generate_series() to text in 'DD-MM' format to join to the ltg_data table in text format. Says the data types don't match. I've tried extract as well, since that could provide "doy" and "hour", which would work, but I can't seem to match data types in that query either. It's hard to make that "generate_series" a double precision.

SELECT to_char(d, 'MM-DD') AS DAY,
   count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d) 
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
      ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;

Result:

ERROR:  operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
         ^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.

Conclusion: I'm aiming at getting daily and hourly total counts that span many years but group by 'MM-DD' and 'MM-DD-HH' (ignoring year), with the query results showing all days/hours even if they are zero.

Later I'll also try to find averages and percentiles for days and hours, so if you have any advice on that, I'm all ears. But my current problem is focused on just getting a complete result for totals.

解决方案

Basically, to cut off the year, to_char(time, 'MMDD') like you already tried does the job. You just forgot to also apply it to the timestamps generated with generate_series()before joining. And some other minor details.

To simplify and for performance and convenience I suggest this simple function to calculate an integer from the pattern 'MMDD' of a given timestamp.

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';

I used to_char(time, 'MMDD') at first, but switched to the above expression that turned out to be fastest in various tests.

db<>fiddle here

It can be used in expression indexes since it's defined IMMUTABLE. And it still allows function inlining because it only uses EXTRACT (xyz FROM date) - which is implemented with the IMMUTABLE function date_part(text, date) internally. (Note that datepart(text, timestamptz) is only STABLE).

Then this kind of query does the job:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
   SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
   FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
                        , timestamp '2018-12-31'
                        , interval '1 day') d
   ) d
LEFT  JOIN (
   SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
   FROM   counties c
   JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
   WHERE  cwa = 'MFR'
   GROUP  BY 1
   ) ct USING (mmdd)
ORDER  BY 1;

Since time (I would use a different column name) is data type timestamptz the cast time::date depends on the time zone setting of your current session. ("Days" are defined by the time zone you are in.) To get immutable (but slower) results use the AT TIME ZONE construct with a time zone name like:

SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...

Details:

Format mmdd any way you like for display.

The cast to integer is optional for the purpose of this particular query. But since you plan to do all kinds of queries, you'll end up wanting an index on the expression:

CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));

(Not needed for this query.)
integer is a bit faster for this purpose. And you need the (otherwise optional) function wrapper for this since to_char() is only defined STABLE, but we need IMMUTABLE for the index. The updated expression (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int is IMMUTABLE, but the function wrapper is still convenient.

Related:

这篇关于如何在MM-DD上汇总来自多年的数据,而忽略年份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆