如何在时间范围内为多个分组添加缺失数据？ [英] How to include missing data for multiple groupings within the time span?

查看：105 发布时间：2018/5/30 14:15:41 sql postgresql group-by left-join aggregate-functions

本文介绍了如何在时间范围内为多个分组添加缺失数据？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下参考查询，其中研究组由过去12个月（包括当月）的教师，学习年份和房间计算。我得到的结果是正确的，但是，我希望在数据丢失时包含零计数行。

我查看了其他几个相关帖子，但无法获得期望的输出：

Postgresql group month wise with missing values

在Rails + Postgres中按任意时间间隔计数记录的最佳方式

以下是查询：

  SELECT 
 upper（trim（t.full_name））AS teacher 
， date_trunc（'month'，s.study_dt）:: date AS study_month 
，r.room_code AS房间
，COUNT（1）AS study_count 
 FROM 
研究AS 
 LEFT OUTER JOIN rooms AS r ON r.id = s.room_id 
 LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id 
 LEFT OUTER JOIN teachers AS t ON t.id = tc .teacher_id 
 WHERE 
 s.study_dt BETWEEN now（） -  interval '13 month'AND now（）
 AND s.study_dt IS NOT NULL 
 GROUP BY 
 teacher 
，study_month 
，房间
 ORDER BY 
老师
，study_month 
，房间;

我得到的输出：

<$ p $ 老师，study_month，房间，study_count DOE，JOHN，2015-07-01，A1，1 DOE，JOHN，2015-12-01，A2，1 DOE，JOHN，2016-01-01，B1，1 SIMPSON，HOMER ，2016-05-01，B2，3 MOUSE，MICKEY，2015-08-01，A2，1 MOUSE，MICKEY，2015 -11-01，B1，1 MOUSE，MICKEY，2015-11-01，B2，2

但我希望计数为0以显示所有缺失的年份和房间组合。例如（只有第一行，共有4个房间： A1 ， A2 ， B1 ， B2 ）：

 老师，study_month，房间，study_count
DOE，JOHN， 2015-07-01，A1，1 
DOE，JOHN，2015-07-01，A2，0 
DOE，JOHN，2015-07 -01，B1，0 
DOE，JOHN，2015-07-01，B2，0 
 ... 
DOE，JOHN， 2015-12-01，A1，1 
DOE，JOHN，2015-12-01，A2，0 
DOE，JOHN，2015-12- 01，B1，0 
DOE，JOHN，2015-12-01，B2，0 
 ...

为了获得缺失的年份，我尝试使用时间序列和加入 time_range.year_month = study_month ，但它不起作用。

  SELECT date_trunc（'month'，time_range）:: date AS year_month 
 FROM generate_series（now（） -  interval '13 month'，now（），'1 month'）AS time_range

所以，我想知道如何填补

a）两年的差距 - 月和房间，作为奖金： b）只是一个月。

 
 
 原因是数据集将被馈送到一个数据透视表图书馆，我们可以得到类似于以下的输出（不能直接在PG中做到这一点）：
  teacher，room，2015 -07，...，2015-12，...，2016-07，总
DOE，JOHN，A1,1，...，1，...，0,2 
DOE，JOHN，A2,0，...，0，...，0,0 
 ...等等... 
  
 
解决方案
基于一些假设（问题中的含糊之处），我建议：
 
 
  SELECT upper（trim（t.full_name））AS teacher 
，m.study_month 
，r.room_code AS room 
，count（ s.room_id）AS study_count 
 
 FROM teachers t 
 CROSS JOIN generate_series（date_trunc（'month'，now（） -  interval '12 month'） -  12！ 
，date_trunc（'month'，now（））
，interval'1 month'）m（study_month）
 CROSS JOIN房间r 
 LEFT JOIN（ - 圆括号！
学习s 
 JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id  -  INNER JOIN！
）ON tc.teacher_id = t.id 
 AND s.study_dt> = m .study_month 
 AND s.study_dt< m.study_month + interval'1 month' -  sargable！ 
 AND s.room_id = r.id 
 GROUP BY t.id，m.study_month，r.id  -  id是相应表格的PK 
 ORDER BY t.id，m.study_month ，r.id; 
  
 
 
 
要点

 
  使用 CROSS JOIN 构建所有期望组合的网格。然后将 LEFT JOIN 添加到现有行。相关： 
 
  
   array_agg group by和null  
 
   
 
 
      
  在你的情况中，它是几个表的连接，所以我在 FROM 列表中使用括号至 LEFT JOIN 到圆括号内的 INNER JOIN 的结果。 
这将是 不正确的 到 LEFT JOIN 分开给每个表，因为您会在部分匹配并获得可能不正确的计数。假设引用完整性并直接使用PK列，我们不需要包含房间和老师在第二次左侧。但我们仍然有两个表（ studies 和 teacher_contacts ）的连接。  teacher_contacts 的角色对我来说还不清楚。通常情况下，我希望直接在学习和老师之间建立关系。可能会进一步简化... 
 
 我们需要计算左侧的非空列以获得所需的计数。像 count（s.room_id） 
 
 
  为了保持大表的快速性，请确保谓词是  sargable  。并且添加匹配的索引。
  
   < ） 独特。使用唯一的ID进行操作，最好是PK（更快，更简单）。我仍然在使用 teacher 作为输出以匹配您想要的结果。包含一个唯一的ID可能是明智的，因为名称可能是重复的。
   你想： 
 
 <在过去的12个月（包括本月），
 
 
 
 因此，从 date_trunc（'month'，now（） -  interval '12 month'（不是13）。这就是已经开始的四舍五入，做了你想要的 - 比你原来的查询更准确。
  
 
 
 
 
 $ b 由于您提到性能低下，根据实际的表定义和数据分布，  先聚合然后再加入 可能更快，就像在这个相关答案中一样： 
 
 
 
   Postgres  - 如何返回行数为0的数据丢失数据？ 
 
 
 
 
  
 
 
 
  SELECT上（trim（t.full_name））AS teacher 
，m.mon AS study_month 
，r.room_code AS room 
，COALESCE（s.ct，0）AS study_count 
 
 FROM teacher t 
 CROSS JOIN generate_series（date_trunc（'month'，now（） -  interval '12 month'） -  12！ 
，date_trunc（'month'，now（））
，interval'1 month'）mon 
 CROSS JOIN房间r 
 LEFT JOIN（ - 圆括号！
 SELECT tc.teacher_id，date_trunc（'month'，s.study_dt）AS mon，s.room_id，count（*）AS ct 
 FROM studies s 
 JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id 
 WHERE s.study_dt> = date_trunc（'month'，now（） -  interval '12 month'） -  sargable 
 GROUP BY 1，2，3 
）s ON s。 teacher_id = t.id 
 AND s.mon = m.mon 
 AND s.room_id = r.id 
 ORDER BY 1，2，3; 
  
 
 
 
 
 
 关于您的结束语：
 
 数据集将被输入到一个数据透视表...（不能直接在PG中完成）
 
 
您可能 使用的双参数形式crosstab（）直接产生你想要的结果，并具有优良的性能，上面的查询不需要开始。考虑： 
 
 
 
   PostgreSQL交叉表查询 
 
 
 
I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.

I looked at several other related posts but could not get desired output:


Postgres - how to return rows with 0 count for missing data?
Postgresql group month wise with missing values
Best way to count records by arbitrary time intervals in Rails+Postgres


Here is the query:
SELECT
    upper(trim(t.full_name))               AS teacher
  , date_trunc('month', s.study_dt)::date  AS study_month
  , r.room_code                            AS room
  , COUNT(1)                               AS study_count
FROM
  studies                           AS s
  LEFT OUTER JOIN rooms             AS r   ON r.id = s.room_id
  LEFT OUTER JOIN teacher_contacts  AS tc  ON tc.id = s.teacher_contact_id
  LEFT OUTER JOIN teachers          AS t   ON t.id = tc.teacher_id 
WHERE
  s.study_dt BETWEEN now() - interval '13 month' AND now()
  AND s.study_dt IS NOT NULL
GROUP BY
    teacher
  , study_month
  , room
ORDER BY 
    teacher  
  , study_month
  , room;
The output I get:
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2
But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...
To get the missing year-months, I tried left outer join on using time series and joining on time_range.year_month = study_month, but it didn't work.
SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range 
So, I'd like to know how to 'fill in the gaps' for

a) both year-month and room and, as a bonus:
 b) just a year-month.

The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):
teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...

 解决方案 
Based on some assumptions (ambiguities in the question) I suggest:
SELECT upper(trim(t.full_name)) AS teacher
     , m.study_month
     , r.room_code              AS room
     , count(s.room_id)         AS study_count

FROM   teachers t
CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                          , date_trunc('month', now())
                          , interval '1 month') m(study_month)
CROSS  JOIN rooms r
LEFT   JOIN (                                                  -- parentheses!
          studies s
   JOIN   teacher_contacts tc ON tc.id = s.teacher_contact_id  -- INNER JOIN!
   ) ON tc.teacher_id = t.id
    AND s.study_dt >= m.study_month
    AND s.study_dt <  m.study_month + interval '1 month'      -- sargable!
    AND s.room_id = r.id
GROUP  BY t.id, m.study_month, r.id  -- id is PK of respective tables
ORDER  BY t.id, m.study_month, r.id;


Major points


Build a grid of all desired combinations with CROSS JOIN. And then LEFT JOIN to existing rows. Related:


array_agg group by and null
Get created as well as deleted entries of last week

In your case, it's a join of several tables, so I use parentheses in the FROM list to LEFT JOIN to the result of INNER JOIN within the parentheses.
It would be incorrect to LEFT JOIN to each table separately, because you would include hits on partial matches and get potentially incorrect counts.
Assuming referential integrity and working with PK columns directly, we don't need to include rooms and teachers on the left side a second time. But we still have a join of two tables (studies and teacher_contacts). The role of teacher_contacts is unclear to me. Normally, I would expect a relationship between studies and teachers directly. Might be further simplified ...
We need to count a non-null column on the left side to get the desired counts. Like count(s.room_id)
To keep this fast for big tables, make sure your predicates are sargable. And add matching indexes.
The column teacher is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still using teacher for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.
You want:

  the past 12 months (including current month).
So start with date_trunc('month', now() - interval '12 month' (not 13). That's rounding down the start already and does what you want - more accurately than your original query.




Since you mentioned slow performance, depending on actual table definitions and data distribution, it's probably faster to aggregate first and join later, like in this related answer:


Postgres - how to return rows with 0 count for missing data?




SELECT upper(trim(t.full_name)) AS teacher
     , m.mon                    AS study_month
     , r.room_code              AS room
     , COALESCE(s.ct, 0)        AS study_count

FROM   teachers t
CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                          , date_trunc('month', now())
                          , interval '1 month') mon
CROSS  JOIN rooms r
LEFT   JOIN (                                                  -- parentheses!
   SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
   FROM   studies s
   JOIN   teacher_contacts tc ON s.teacher_contact_id = tc.id
   WHERE  s.study_dt >= date_trunc('month', now() - interval '12 month')  -- sargable
   GROUP  BY 1, 2, 3
   ) s ON s.teacher_id = t.id
      AND s.mon = m.mon
      AND s.room_id = r.id
ORDER  BY 1, 2, 3;




About your closing remark:

  the dataset would be fed to a pivot library ... (could not do this in PG directly)
Chances are you can use the two-parameter form of crosstab() to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:


PostgreSQL Crosstab Query


                        这篇关于如何在时间范围内为多个分组添加缺失数据？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Major points

如何在时间范围内为多个分组添加缺失数据？ [英] How to include missing data for multiple groupings within the time span?

问题描述

要点

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在时间范围内为多个分组添加缺失数据？ [英] How to include missing data for multiple groupings within the time span?

问题描述

要点

Major points

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭