如何使用窗口函数来确定何时在Hive或Postgres中执行不同的任务? [英] How to use a window function to determine when to perform different tasks in Hive or Postgres?

查看:222
本文介绍了如何使用窗口函数来确定何时在Hive或Postgres中执行不同的任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是SQL新手,需要能够在Hive和Postgres中解决以下问题。

数据



我有一些数据显示每个人不同的预先设定好的任务的开始日期和结束日期:

  person task_key start_day end_day 
1 Kate A 1 5
2 Kate B 1 5
3 Adam A 1 5
4 Adam B 2 5
5 Eve A 2 5
6 Eve B 1 5
7 Jason A 1 5
8 Jason B 4 5
9 Jason C 3 5
10 Jason D 5 5
11 Jason E 4 5

注意:更高的优先级。



问题

我需要确定每个人每天应该工作的任务,条件是:


  1. 较高字母任务优先于较低字母任务。

  2. 如果较高的字母任务重叠较低的字母任务的任何部分,则较低的字母任务设置为NA(表示该人不应该使用它)。


    简化
    在真实数据中,end_day始终为5,即只有start_day变化,但end_day是不变。这意味着我所需的输出将与我的原始表具有相同的行数:)

    输出

    这是我需要的那种输出(Jason比我更具有代表性,可以完成100个任务,覆盖90天):

      person task_key start_day end_day valid_from valid_to 
    1 Kate A 1 5 NA NA
    2 Kate B 1 5 1 5
    3 Adam A 1 5 1 2
    4 Adam B 2 5 2 5
    5 Eve A 2 5 NA NA
    6 Eve B 1 5 1 5
    7 Jason A 1 5 1 3
    8 Jason B 4 5不适用
    9 Jason C 3 5 3 4
    10 Jason D 5 5 NA NA
    11 Jason E 4 5 4 5

    预先感谢您的时间。



    PS类似的问题,我问过,但在R:如何使用窗口函数来确定何时执行不同的任务?解决方案Postgres中的解决方案相当容易,因为它支持 generate_series()。首先,为表格中的每一行分解每天一行的数据:

      select d。*,gs.dy 
    from data d,lateral
    generate_series(start_day,end_day)gs(dy);

    然后,聚合以获得每天的任务:

     选择d.person,d.dy,max(d.task_key)作为task_key 
    from(select d。*,gs.dy
    from数据d,横向
    generate_series(start_day,end_day)gs(dy)
    )d
    d.person,d.dy组;

    然后您可以重新聚合,但这很棘手,因为您可能会拆分原始行(见我的评论)。这可以解答你在哪一天执行哪个任务的问题。



    你可以在没有横向连接或 generate_series() code>通过使用数字/计数表。


    I am new to SQL and need to be able to solve the following problem in both Hive and Postgres.

    Data

    I have a some data showing the start day and end day for different pre-prioritised tasks per person:

       person      task_key start_day end_day
    1    Kate             A         1       5
    2    Kate             B         1       5
    3    Adam             A         1       5
    4    Adam             B         2       5
    5     Eve             A         2       5
    6     Eve             B         1       5
    7   Jason             A         1       5
    8   Jason             B         4       5
    9   Jason             C         3       5
    10  Jason             D         5       5
    11  Jason             E         4       5
    

    NOTE: Task key is ordered so that higher letters have higher priorities.

    Question

    I need to work out which task each person should be working on each day, with the condition that:

    1. Higher lettered tasks take priority over lower lettered tasks.
    2. If a higher lettered task overlaps any part of a lower lettered task, then the lower lettered task gets set to NA (to represent that the person should not work on it ever).

    Simplification In the real data the end_day is always 5 in the original table i.e. only the start_day varies but the end_day is constant. This means my desired output will have the same number of rows as my original table :)

    Output

    This is the sort of output I need (Jason is more representative of the data I have which can be over 100 tasks covering a period of 90 days):

       person    task_key start_day end_day valid_from valid_to
    1    Kate           A         1       5         NA       NA
    2    Kate           B         1       5          1        5
    3    Adam           A         1       5          1        2
    4    Adam           B         2       5          2        5
    5     Eve           A         2       5         NA       NA
    6     Eve           B         1       5          1        5
    7   Jason           A         1       5          1        3
    8   Jason           B         4       5         NA       NA
    9   Jason           C         3       5          3        4
    10  Jason           D         5       5         NA       NA
    11  Jason           E         4       5          4        5
    

    Thank you for your time in advance.

    P.S. Similar question I have asked but in R: How to use a window function to determine when to perform different tasks?

    解决方案

    The solution in Postgres is fairly easy, because it supports generate_series(). First, explode the data for one row per day for each row in your table:

    select d.*, gs.dy
    from data d, lateral
         generate_series(start_day, end_day) gs(dy);
    

    Then, aggregate to get the task for each day:

    select d.person, d.dy, max(d.task_key) as task_key
    from (select d.*, gs.dy
          from data d, lateral
               generate_series(start_day, end_day) gs(dy)
         ) d
    group by d.person, d.dy;
    

    You can then re-aggregate, but this is tricky because you might have "split" the original rows (see my comment). This answers your question about which task to perform on which day.

    You can do all of this without a lateral join or generate_series() by using a number/tally table.

    这篇关于如何使用窗口函数来确定何时在Hive或Postgres中执行不同的任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆