如何在时间范围内为多个分组添加缺失数据? [英] How to include missing data for multiple groupings within the time span?

查看:105
本文介绍了如何在时间范围内为多个分组添加缺失数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下参考查询,其中研究组由过去12个月(包括当月)的教师,学习年份和房间计算。我得到的结果是正确的,但是,我希望在数据丢失时包含零计数行。



我查看了其他几个相关帖子,但无法获得期望的输出:





以下是查询:

  SELECT 
upper(trim(t.full_name))AS teacher
, date_trunc('month',s.study_dt):: date AS study_month
,r.room_code AS房间
,COUNT(1)AS study_count
FROM
研究AS
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc .teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month'AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
,study_month
,房间
ORDER BY
老师
,study_month
,房间;

我得到的输出:

<$ p $ 老师,study_month,房间,study_count
DOE,JOHN,2015-07-01,A1,1
DOE,JOHN,2015-12-01,A2,1
DOE,JOHN,2016-01-01,B1,1
SIMPSON,HOMER ,2016-05-01,B2,3
MOUSE,MICKEY,2015-08-01,A2,1
MOUSE,MICKEY,2015 -11-01,B1,1
MOUSE,MICKEY,2015-11-01,B2,2

但我希望计数为0以显示所有缺失的年份和房间组合。例如(只有第一行,共有4个房间: A1 A2 B1 B2 ) :

 老师,study_month,房间,study_count
DOE,JOHN, 2015-07-01,A1,1
DOE,JOHN,2015-07-01,A2,0
DOE,JOHN,2015-07 -01,B1,0
DOE,JOHN,2015-07-01,B2,0
...
DOE,JOHN, 2015-12-01,A1,1
DOE,JOHN,2015-12-01,A2,0
DOE,JOHN,2015-12- 01,B1,0
DOE,JOHN,2015-12-01,B2,0
...

为了获得缺失的年份,我尝试使用时间序列和加入 time_range.year_month = study_month ,但它不起作用。

  SELECT date_trunc('month',time_range):: date AS year_month 
FROM generate_series(now() - interval '13 month',now(),'1 month')AS time_range

所以,我想知道如何填补

a)两年的差距 - 月和房间,作为奖金:
b)只是一个月。



原因是数据集将被馈送到一个数据透视表图书馆,我们可以得到类似于以下的输出(不能直接在PG中做到这一点):

  teacher,room,2015 -07,...,2015-12,...,2016-07,总
DOE,JOHN,A1,1,...,1,...,0,2
DOE,JOHN,A2,0,...,0,...,0,0
...等等...

解决方案

基于一些假设(问题中的含糊之处),我建议:

  SELECT upper(trim(t.full_name))AS teacher 
,m.study_month
,r.room_code AS room
,count( s.room_id)AS study_count

FROM teachers t
CROSS JOIN generate_series(date_trunc('month',now() - interval '12 month') - 12!
,date_trunc('month',now())
,interval'1 month')m(study_month)
CROSS JOIN房间r
LEFT JOIN( - 圆括号!
学习s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id - INNER JOIN!
)ON tc.teacher_id = t.id
AND s.study_dt> = m .study_month
AND s.study_dt< m.study_month + interval'1 month' - sargable!
AND s.room_id = r.id
GROUP BY t.id,m.study_month,r.id - id是相应表格的PK
ORDER BY t.id,m.study_month ,r.id;



要点




  • 使用 CROSS JOIN 构建所有期望组合的网格。然后将 LEFT JOIN 添加到现有行。相关:


  • 在你的情况中,它是几个表的连接,所以我在 FROM 列表中使用括号至 LEFT JOIN 到圆括号内的 INNER JOIN 的结果。
    这将是 不正确的 LEFT JOIN 分开给每个表,因为您会在部分匹配并获得可能不正确的计数。假设引用完整性并直接使用PK列,我们不需要包含房间老师在第二次左侧。但我们仍然有两个表( studies 和 teacher_contacts )的连接。 teacher_contacts 的角色对我来说还不清楚。通常情况下,我希望直接在学习老师之间建立关系。可能会进一步简化...

  • 我们需要计算左侧的非空列以获得所需的计数。像 count(s.room_id)


  • 为了保持大表的快速性,请确保谓词是 sargable 。并且添加匹配的索引


  • < ) 独特。使用唯一的ID进行操作,最好是PK(更快,更简单)。我仍然在使用 teacher 作为输出以匹配您想要的结果。包含一个唯一的ID可能是明智的,因为名称可能是重复的。

  • 你想:

    <在过去的12个月(包括本月),



    因此,从 date_trunc('month',now() - interval '12 month'(不是13)。这就是已经开始的四舍五入,做了你想要的 - 比你原来的查询更准确。






    $ b

    由于您提到性能低下,根据实际的表定义和数据分布, 先聚合然后再加入 可能更快,就像在这个相关答案中一样:





      SELECT上(trim(t.full_name))AS teacher 
    ,m.mon AS study_month
    ,r.room_code AS room
    ,COALESCE(s.ct,0)AS study_count

    FROM teacher t
    CROSS JOIN generate_series(date_trunc('month',now() - interval '12 month') - 12!
    ,date_trunc('month',now())
    ,interval'1 month')mon
    CROSS JOIN房间r
    LEFT JOIN( - 圆括号!
    SELECT tc.teacher_id,date_trunc('month',s.study_dt)AS mon,s.room_id,count(*)AS ct
    FROM studies s
    JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
    WHERE s.study_dt> = date_trunc('month',now() - interval '12 month') - sargable
    GROUP BY 1,2,3
    )s ON s。 teacher_id = t.id
    AND s.mon = m.mon
    AND s.room_id = r.id
    ORDER BY 1,2,3;






    关于您的结束语:


    数据集将被输入到一个数据透视表...(不能直接在PG中完成)


    您可能 使用的双参数形式crosstab()直接产生你想要的结果,并具有优良的性能,上面的查询不需要开始。考虑:


    I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.

    I looked at several other related posts but could not get desired output:

    Here is the query:

    SELECT
        upper(trim(t.full_name))               AS teacher
      , date_trunc('month', s.study_dt)::date  AS study_month
      , r.room_code                            AS room
      , COUNT(1)                               AS study_count
    FROM
      studies                           AS s
      LEFT OUTER JOIN rooms             AS r   ON r.id = s.room_id
      LEFT OUTER JOIN teacher_contacts  AS tc  ON tc.id = s.teacher_contact_id
      LEFT OUTER JOIN teachers          AS t   ON t.id = tc.teacher_id 
    WHERE
      s.study_dt BETWEEN now() - interval '13 month' AND now()
      AND s.study_dt IS NOT NULL
    GROUP BY
        teacher
      , study_month
      , room
    ORDER BY 
        teacher  
      , study_month
      , room;
    

    The output I get:

    "teacher","study_month","room","study_count"
    "DOE, JOHN","2015-07-01","A1",1
    "DOE, JOHN","2015-12-01","A2",1
    "DOE, JOHN","2016-01-01","B1",1
    "SIMPSON, HOMER","2016-05-01","B2",3
    "MOUSE, MICKEY","2015-08-01","A2",1
    "MOUSE, MICKEY","2015-11-01","B1",1
    "MOUSE, MICKEY","2015-11-01","B2",2
    

    But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):

    "teacher","study_month","room","study_count"
    "DOE, JOHN","2015-07-01","A1",1
    "DOE, JOHN","2015-07-01","A2",0
    "DOE, JOHN","2015-07-01","B1",0
    "DOE, JOHN","2015-07-01","B2",0
    ...
    "DOE, JOHN","2015-12-01","A1",1
    "DOE, JOHN","2015-12-01","A2",0
    "DOE, JOHN","2015-12-01","B1",0
    "DOE, JOHN","2015-12-01","B2",0
    ...
    

    To get the missing year-months, I tried left outer join on using time series and joining on time_range.year_month = study_month, but it didn't work.

    SELECT date_trunc('month', time_range)::date AS year_month
    FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range 
    

    So, I'd like to know how to 'fill in the gaps' for

    a) both year-month and room and, as a bonus: b) just a year-month.

    The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):

    teacher,room,2015-07,...,2015-12,...,2016-07,total
    "DOE, JOHN",A1,1,...,1,...,0,2
    "DOE, JOHN",A2,0,...,0,...,0,0
    ...and so on...
    

    解决方案

    Based on some assumptions (ambiguities in the question) I suggest:

    SELECT upper(trim(t.full_name)) AS teacher
         , m.study_month
         , r.room_code              AS room
         , count(s.room_id)         AS study_count
    
    FROM   teachers t
    CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                              , date_trunc('month', now())
                              , interval '1 month') m(study_month)
    CROSS  JOIN rooms r
    LEFT   JOIN (                                                  -- parentheses!
              studies s
       JOIN   teacher_contacts tc ON tc.id = s.teacher_contact_id  -- INNER JOIN!
       ) ON tc.teacher_id = t.id
        AND s.study_dt >= m.study_month
        AND s.study_dt <  m.study_month + interval '1 month'      -- sargable!
        AND s.room_id = r.id
    GROUP  BY t.id, m.study_month, r.id  -- id is PK of respective tables
    ORDER  BY t.id, m.study_month, r.id;
    

    Major points

    • Build a grid of all desired combinations with CROSS JOIN. And then LEFT JOIN to existing rows. Related:

    • In your case, it's a join of several tables, so I use parentheses in the FROM list to LEFT JOIN to the result of INNER JOIN within the parentheses. It would be incorrect to LEFT JOIN to each table separately, because you would include hits on partial matches and get potentially incorrect counts.

    • Assuming referential integrity and working with PK columns directly, we don't need to include rooms and teachers on the left side a second time. But we still have a join of two tables (studies and teacher_contacts). The role of teacher_contacts is unclear to me. Normally, I would expect a relationship between studies and teachers directly. Might be further simplified ...

    • We need to count a non-null column on the left side to get the desired counts. Like count(s.room_id)

    • To keep this fast for big tables, make sure your predicates are sargable. And add matching indexes.

    • The column teacher is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still using teacher for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.

    • You want:

      the past 12 months (including current month).

      So start with date_trunc('month', now() - interval '12 month' (not 13). That's rounding down the start already and does what you want - more accurately than your original query.


    Since you mentioned slow performance, depending on actual table definitions and data distribution, it's probably faster to aggregate first and join later, like in this related answer:

    SELECT upper(trim(t.full_name)) AS teacher
         , m.mon                    AS study_month
         , r.room_code              AS room
         , COALESCE(s.ct, 0)        AS study_count
    
    FROM   teachers t
    CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                              , date_trunc('month', now())
                              , interval '1 month') mon
    CROSS  JOIN rooms r
    LEFT   JOIN (                                                  -- parentheses!
       SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
       FROM   studies s
       JOIN   teacher_contacts tc ON s.teacher_contact_id = tc.id
       WHERE  s.study_dt >= date_trunc('month', now() - interval '12 month')  -- sargable
       GROUP  BY 1, 2, 3
       ) s ON s.teacher_id = t.id
          AND s.mon = m.mon
          AND s.room_id = r.id
    ORDER  BY 1, 2, 3;
    


    About your closing remark:

    the dataset would be fed to a pivot library ... (could not do this in PG directly)

    Chances are you can use the two-parameter form of crosstab() to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:

    这篇关于如何在时间范围内为多个分组添加缺失数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆