如何在时间范围内为多个分组添加缺失数据? [英] How to include missing data for multiple groupings within the time span?
问题描述
我有以下参考查询,其中研究组由过去12个月(包括当月)的教师,学习年份和房间计算。我得到的结果是正确的,但是,我希望在数据丢失时包含零计数行。
我查看了其他几个相关帖子,但无法获得期望的输出:
-
Postgresql group month wise with missing values - 在Rails + Postgres中按任意时间间隔计数记录的最佳方式
以下是查询:
SELECT
upper(trim(t.full_name))AS teacher
, date_trunc('month',s.study_dt):: date AS study_month
,r.room_code AS房间
,COUNT(1)AS study_count
FROM
研究AS
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc .teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month'AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
,study_month
,房间
ORDER BY
老师
,study_month
,房间;
我得到的输出:
<$ p $ 老师,study_month,房间,study_count
DOE,JOHN,2015-07-01,A1,1
DOE,JOHN,2015-12-01,A2,1
DOE,JOHN,2016-01-01,B1,1
SIMPSON,HOMER ,2016-05-01,B2,3
MOUSE,MICKEY,2015-08-01,A2,1
MOUSE,MICKEY,2015 -11-01,B1,1
MOUSE,MICKEY,2015-11-01,B2,2
但我希望计数为0以显示所有缺失的年份和房间组合。例如(只有第一行,共有4个房间: A1 , A2 , B1 , B2 ) :
老师,study_month,房间,study_count
DOE,JOHN, 2015-07-01,A1,1
DOE,JOHN,2015-07-01,A2,0
DOE,JOHN,2015-07 -01,B1,0
DOE,JOHN,2015-07-01,B2,0
...
DOE,JOHN, 2015-12-01,A1,1
DOE,JOHN,2015-12-01,A2,0
DOE,JOHN,2015-12- 01,B1,0
DOE,JOHN,2015-12-01,B2,0
...
为了获得缺失的年份,我尝试使用时间序列和加入 time_range.year_month = study_month
,但它不起作用。
SELECT date_trunc('month',time_range):: date AS year_month
FROM generate_series(now() - interval '13 month',now(),'1 month')AS time_range
所以,我想知道如何填补
a)两年的差距 - 月和房间,作为奖金:
b)只是一个月。
原因是数据集将被馈送到一个数据透视表图书馆,我们可以得到类似于以下的输出(不能直接在PG中做到这一点):
teacher,room,2015 -07,...,2015-12,...,2016-07,总
DOE,JOHN,A1,1,...,1,...,0,2
DOE,JOHN,A2,0,...,0,...,0,0
...等等...
基于一些假设(问题中的含糊之处),我建议:
SELECT upper(trim(t.full_name))AS teacher
,m.study_month
,r.room_code AS room
,count( s.room_id)AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month',now() - interval '12 month') - 12!
,date_trunc('month',now())
,interval'1 month')m(study_month)
CROSS JOIN房间r
LEFT JOIN( - 圆括号!
学习s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id - INNER JOIN!
)ON tc.teacher_id = t.id
AND s.study_dt> = m .study_month
AND s.study_dt< m.study_month + interval'1 month' - sargable!
AND s.room_id = r.id
GROUP BY t.id,m.study_month,r.id - id是相应表格的PK
ORDER BY t.id,m.study_month ,r.id;
要点
-
使用
CROSS JOIN
构建所有期望组合的网格。然后将LEFT JOIN
添加到现有行。相关:
-
在你的情况中,它是几个表的连接,所以我在
FROM
列表中使用括号至LEFT JOIN
到圆括号内的INNER JOIN
的结果。
这将是 不正确的 到LEFT JOIN
分开给每个表,因为您会在部分匹配并获得可能不正确的计数。假设引用完整性并直接使用PK列,我们不需要包含房间
和老师
在第二次左侧。但我们仍然有两个表(studies 和
teacher_contacts
)的连接。teacher_contacts
的角色对我来说还不清楚。通常情况下,我希望直接在学习
和老师
之间建立关系。可能会进一步简化...
我们需要计算左侧的非空列以获得所需的计数。像
count(s.room_id)
为了保持大表的快速性,请确保谓词是 sargable 。并且添加匹配的索引。
< ) 独特。使用唯一的ID进行操作,最好是PK(更快,更简单)。我仍然在使用
teacher
作为输出以匹配您想要的结果。包含一个唯一的ID可能是明智的,因为名称可能是重复的。 你想:
<在过去的12个月(包括本月),
因此,从 date_trunc('month',now() - interval '12 month'
(不是13)。这就是已经开始的四舍五入,做了你想要的 - 比你原来的查询更准确。
$ b
由于您提到性能低下,根据实际的表定义和数据分布, 先聚合然后再加入 可能更快,就像在这个相关答案中一样:
SELECT上(trim(t.full_name))AS teacher
,m.mon AS study_month
,r.room_code AS room
,COALESCE(s.ct,0)AS study_count
FROM teacher t
CROSS JOIN generate_series(date_trunc('month',now() - interval '12 month') - 12!
,date_trunc('month',now())
,interval'1 month')mon
CROSS JOIN房间r
LEFT JOIN( - 圆括号!
SELECT tc.teacher_id,date_trunc('month',s.study_dt)AS mon,s.room_id,count(*)AS ct
FROM studies s
JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
WHERE s.study_dt> = date_trunc('month',now() - interval '12 month') - sargable
GROUP BY 1,2,3
)s ON s。 teacher_id = t.id
AND s.mon = m.mon
AND s.room_id = r.id
ORDER BY 1,2,3;
关于您的结束语:
数据集将被输入到一个数据透视表...(不能直接在PG中完成)
您可能 使用的双参数形式crosstab()
直接产生你想要的结果,并具有优良的性能,上面的查询不需要开始。考虑:
I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.
I looked at several other related posts but could not get desired output:
- Postgres - how to return rows with 0 count for missing data?
- Postgresql group month wise with missing values
- Best way to count records by arbitrary time intervals in Rails+Postgres
Here is the query:
SELECT
upper(trim(t.full_name)) AS teacher
, date_trunc('month', s.study_dt)::date AS study_month
, r.room_code AS room
, COUNT(1) AS study_count
FROM
studies AS s
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month' AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
, study_month
, room
ORDER BY
teacher
, study_month
, room;
The output I get:
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2
But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...
To get the missing year-months, I tried left outer join on using time series and joining on time_range.year_month = study_month
, but it didn't work.
SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range
So, I'd like to know how to 'fill in the gaps' for
a) both year-month and room and, as a bonus: b) just a year-month.
The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):
teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...
Based on some assumptions (ambiguities in the question) I suggest:
SELECT upper(trim(t.full_name)) AS teacher
, m.study_month
, r.room_code AS room
, count(s.room_id) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') m(study_month)
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
studies s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id -- INNER JOIN!
) ON tc.teacher_id = t.id
AND s.study_dt >= m.study_month
AND s.study_dt < m.study_month + interval '1 month' -- sargable!
AND s.room_id = r.id
GROUP BY t.id, m.study_month, r.id -- id is PK of respective tables
ORDER BY t.id, m.study_month, r.id;
Major points
Build a grid of all desired combinations with
CROSS JOIN
. And thenLEFT JOIN
to existing rows. Related:In your case, it's a join of several tables, so I use parentheses in the
FROM
list toLEFT JOIN
to the result ofINNER JOIN
within the parentheses. It would be incorrect toLEFT JOIN
to each table separately, because you would include hits on partial matches and get potentially incorrect counts.Assuming referential integrity and working with PK columns directly, we don't need to include
rooms
andteachers
on the left side a second time. But we still have a join of two tables (studies
andteacher_contacts
). The role ofteacher_contacts
is unclear to me. Normally, I would expect a relationship betweenstudies
andteachers
directly. Might be further simplified ...We need to count a non-null column on the left side to get the desired counts. Like
count(s.room_id)
To keep this fast for big tables, make sure your predicates are sargable. And add matching indexes.
The column
teacher
is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still usingteacher
for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.You want:
the past 12 months (including current month).
So start with
date_trunc('month', now() - interval '12 month'
(not 13). That's rounding down the start already and does what you want - more accurately than your original query.
Since you mentioned slow performance, depending on actual table definitions and data distribution, it's probably faster to aggregate first and join later, like in this related answer:
SELECT upper(trim(t.full_name)) AS teacher
, m.mon AS study_month
, r.room_code AS room
, COALESCE(s.ct, 0) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') mon
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
FROM studies s
JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
WHERE s.study_dt >= date_trunc('month', now() - interval '12 month') -- sargable
GROUP BY 1, 2, 3
) s ON s.teacher_id = t.id
AND s.mon = m.mon
AND s.room_id = r.id
ORDER BY 1, 2, 3;
About your closing remark:
the dataset would be fed to a pivot library ... (could not do this in PG directly)
Chances are you can use the two-parameter form of crosstab()
to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:
这篇关于如何在时间范围内为多个分组添加缺失数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!