如何使用窗口函数确定何时执行不同的任务? [英] How to use a window function to determine when to perform different tasks?

查看:145
本文介绍了如何使用窗口函数确定何时执行不同的任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我问过类似的问题,要求使用SQL-如何使用窗口函数确定何时在Hive或Postgres中执行不同的任务?

Note: Similar question I have asked for SQL - How to use a window function to determine when to perform different tasks in Hive or Postgres?

数据

我有一些数据显示了每个人不同的优先任务的开始日期和结束日期:

I have a some data showing the start day and end day for different pre-prioritised tasks per person:

   input_df <- data.frame(person        = c(rep("Kate", 2), rep("Adam", 2), rep("Eve", 2), rep("Jason", 5)),
                       task_key   = c(c("A","B"), c("A","B"), c("A","B"), c("A","B","C","D","E")),
                       start_day     = c(c(1L,1L), c(1L,2L), c(2L,1L), c(1L,4L,3L,5L,4L)),
                       end_day       = 5L)




   person      task_key start_day end_day
1    Kate             A         1       5
2    Kate             B         1       5
3    Adam             A         1       5
4    Adam             B         2       5
5     Eve             A         2       5
6     Eve             B         1       5
7   Jason             A         1       5
8   Jason             B         4       5
9   Jason             C         3       5
10  Jason             D         5       5
11  Jason             E         4       5


注意:任务键的顺序应使字母优先级更高。

NOTE: Task key is ordered so that higher letters have higher priorities.

问题

我需要确定每个人每天应该执行的任务,条件是:

I need to work out which task each person should be working on each day, with the condition that:


  1. 高字母任务优先于低字母任务。

  2. 如果较高字母的任务与较低字母任务的任何部分重叠,则较低字母任务将设置为NA(表示该人永远不要从事此工作)。
  3. li>
  1. Higher lettered tasks take priority over lower lettered tasks.
  2. If a higher lettered task overlaps any part of a lower lettered task, then the lower lettered task gets set to NA (to represent that the person should not work on it ever).

简化

在原始表中始终为5,即只有start_day变化,而end_day是恒定的。这意味着我想要的输出将具有与原始表相同的行数:)

In the real data the end_day is always 5 in the original table i.e. only the start_day varies but the end_day is constant. This means my desired output will have the same number of rows as my original table :)

输出

这是我需要的输出(Jason更能代表我拥有的数据,这些数据可以覆盖100个任务,涵盖90天的时间):

This is the sort of output I need (Jason is more representative of the data I have which can be over 100 tasks covering a period of 90 days):

output_df <- data.frame(person        = c(rep("Kate", 2), rep("Adam", 2), rep("Eve", 2), rep("Jason", 5)),
                        task_key   = c(c("A","B"), c("A","B"), c("A","B"), c("A","B","C","D","E")),
                        start_day     = c(c(1L,1L), c(1L,2L), c(2L,1L), c(1L,4L,3L,5L,4L)),
                        end_day       = 5L,
                        valid_from    = c( c(NA,1L), c(1L,2L), c(NA,1L), c(1L,NA,3L,NA,4L) ),
                        valid_to      = c( c(NA,5L), c(2L,5L), c(NA,5L), c(3L,NA,4L,NA,5L) ))




   person    task_key start_day end_day valid_from valid_to
1    Kate           A         1       5         NA       NA
2    Kate           B         1       5          1        5
3    Adam           A         1       5          1        2
4    Adam           B         2       5          2        5
5     Eve           A         2       5         NA       NA
6     Eve           B         1       5          1        5
7   Jason           A         1       5          1        3
8   Jason           B         4       5         NA       NA
9   Jason           C         3       5          3        4
10  Jason           D         5       5         NA       NA
11  Jason           E         4       5          4        5


初步想法

可以,但是我想要一个使用dbplyr软件包功能的解决方案,并且总体上比这更好:

Works but I want a solution that works using the dbplyr package functions and something that is generally better than this:

tmp            <- input_df %>% filter(person == "Jason")
num_rows       <- nrow(tmp)
tmp$valid_from <- NA
tmp$valid_to   <- NA

for(i in 1:num_rows) {
  # Curent value
  current_value <- tmp$start_day[i]

  # Values to test against
  vec <- lead(tmp$start, i)

  # test
  test <- current_value >= vec

  # result  
  if(any(test, na.rm = TRUE) & i!=num_rows) {
    tmp$valid_from[i] <- NA
    tmp$valid_to[i]   <- NA
  } else if(i!=num_rows) {
    tmp$valid_from[i] <- current_value 
    tmp$valid_to[i]   <- min(vec, na.rm = TRUE)
  } else {
    tmp$valid_from[i] <- current_value 
    tmp$valid_to[i]   <- max(tmp$end_day, na.rm = TRUE)
  }

}
tmp




  person task_number start_day end_day valid_from valid_to
1  Jason           A         1       5          1        3
2  Jason           B         4       5         NA       NA
3  Jason           C         3       5          3        4
4  Jason           D         5       5         NA       NA
5  Jason           E         4       5          4        5


后续问题

最终,我需要在SQL中执行此操作,但这似乎太难了。我听说'dbply'软件包可以为我提供帮助,因为如果我可以使用dplyr函数解决此问题,那么它将以某种方式将其转换为有效的SQL查询?

Eventually I'll need to do this in SQL but that seems too hard. I heard that the 'dbply' package could help me here because if I can solve this using the dplyr functions then it will somehow convert that to a valid SQL query?

推荐答案

使用tidyverse 包。 map2 unstest 用于扩展数据集。 arrange(person,desc(task_key)) distinct(person,Days,.keep_all = TRUE)被删除根据 task_key 的顺序重复。之后,我们可以使用 slice 选择最后一行并操作开始和结束日期。

A solution using the tidyverse package. map2 and unnest are to expand the dataset. arrange(person, desc(task_key)) and distinct(person, Days, .keep_all = TRUE) are to remove duplicates based on the order of task_key. After that, we can use slice to select the last row and manipulate the start and end dates.

library(tidyverse)

output_df <- input_df %>%
  mutate(Days = map2(start_day, end_day, `:`)) %>%
  unnest() %>%
  arrange(person, desc(task_key)) %>%
  distinct(person, Days, .keep_all = TRUE) %>%
  arrange(person, task_key, Days) %>%
  group_by(person, task_key) %>%
  slice(n()) %>%
  mutate(end_day = ifelse(Days < end_day, Days + 1L, end_day)) %>%
  select(-Days) %>%
  rename(valid_from = start_day, valid_to = end_day) %>%
  right_join(input_df, by = c("person", "task_key")) %>%
  select(names(input_df), starts_with("valid")) %>%
  ungroup()
output_df
# # A tibble: 11 x 6
#    person task_key start_day end_day valid_from valid_to
#    <fct>  <fct>        <int>   <int>      <int>    <int>
#  1 Kate   A                1       5         NA       NA
#  2 Kate   B                1       5          1        5
#  3 Adam   A                1       5          1        2
#  4 Adam   B                2       5          2        5
#  5 Eve    A                2       5         NA       NA
#  6 Eve    B                1       5          1        5
#  7 Jason  A                1       5          1        3
#  8 Jason  B                4       5         NA       NA
#  9 Jason  C                3       5          3        4
# 10 Jason  D                5       5         NA       NA
# 11 Jason  E                4       5          4        5

这篇关于如何使用窗口函数确定何时执行不同的任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆