对于从列选择循环选择时间窗 [英] For loop - select time window from day column

查看:146
本文介绍了对于从列选择循环选择时间窗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要调整代码,这与我的数据框(但是另一个设置)完美匹配,以便从列Day中选择一个2天的时间窗口。特别是我有兴趣在前一天0(即i-1和i,其中i是感兴趣的日期),并且它列在Count中的(i - 1)值必须添加到第0天(i )col count。



这里是我的数据框的一个例子:

  df<  -  read.table(text =
站日计数
1 33012 12448 4
2 35004 12448 4
3 35008 12448 4
4 37006 12448 4
5 21009 4835 3
6 24005 4835 3
7 27001 4835 3
8 25005 12447 3
9 29001 12447 3
10 29002 12447 3
11 29002 12446 3
12 30001 12446 3
13 31002 12446 3
14 47007 4834 2
15 49002 4834 2
16 47004 12445 1
17 51001 12449 1
18 51003 4832 1
19 52004 4836 1,header = TRUE)
/ pre>

我的输出应该是:

 站日计数
1 33012 12448 7
2 35004 12448 7
3 35008 12448 7
4 37006 12448 7
5 21009 4835 5
6 24005 4835 5
7 27001 4835 5
8 29002 12446 4
9 30001 12446 4
10 31002 12446 4
11 51001 12449 1
12 51003 4832 1
13 52004 4836 1
14 25005 12447 0
15 29001 1244 7 0
16 29002 12447 0
17 47007 4834 0
18 49002 4834 0
19 47004 12445 0

我正在尝试这段代码,但它并不适用于我的真实数据框:

  for(i in unique(df $ Day)){
temp < - df $ Count [df $ Day == i]
if(length(temp> 0)){
condition1< - df $ Day == i - 1
if(any(condition1)){
df $ Count [df $ Day == i] (df $ Count [condition1])+ df $ Count [df $ Day == i]
df $ Count [condition1] < - $ $ $ $ $ $ $ $ $ $ $

代码似乎是正确的,它有意义,但我的输出不是。



任何人都可以帮助我吗?






@aichao代码工作正常。如果我想考虑前30天(即第30天,第29天,第28天,....,第1天, day0)有没有快速的方法来做,而不是创建30 if语句(条件)?



再次感谢@aichao的帮助。

解决方案

以下是对您给出的示例数据的要求。

 code> for(i in unique(df $ Day)){
temp < - df $ Count [df $ Day == i]
if(any(temp> 0)) {
condition1 < - df $ Day == i - 1
condition1 [which(df $ Day == i - 1)< max(which(df $ Day == i))]< - FALSE
if(any(condition1)){
df $ Count [df $ Day == i] $ count [condition1])+ df $ Count [df $ Day == i]
df $ Count [condition1]< - 0
}
}
}
打印(df [order(df $ Count,decrease = TRUE)]]
##站日计数
## 1 33012 12448 7
## 2 35004 12448 7
## 3 35008 12448 7
## 4 37006 12448 7
## 5 21009 4835 5
## 6 24005 4835 5
## 7 27001 4835 5
## 11 29002 12446 4
## 12 30001 12446 4
## 13 31002 12446 4
## 17 51001 12449 1
## 18 51003 4832 1
## 19 52004 4836 1
## 8 25005 12447 0
## 9 29001 12447 0
## 10 29002 12447 0
## 14 47007 4834 0
## 15 49002 4834 0
## 16 47004 12445 0

一个关键要求闪闪发光从你的执行缺失的评论在确定前一天及其计数时,仅考虑数据帧(行)进一步下降的几天。也就是说,您正在处理数据帧行,就像它们被及时订购一样,而不考虑 Day 列中的值作为时间顺序。因此,对于 df $ Day = 12449 ,因为所有行都有 df $ Day = 12448 在它之前。因此, df $ Day = 12449 计数保持在 1 ,更重要的是,对于 df $ Day = 12448 的所有行, Counts 不是在处理<零code> df $ Day = 12449 之后,将其清零。



为了实现这一点,我们需要进一步过滤 condition1 ,以便我们将 fALSE 所有行 df $ Day == i - 1 (前一天)在 df $ Day == i (感兴趣的日期)的最高行之前使用行

  condition1 [which(df $ Day == i  -  1) max(which(df $ Day == i))]<  -  FALSE 

请注意,假设数据帧中​​ Day 列的相同值与样本数据中的行一样集中在一起。否则,需要重新考虑循环 unique(df $ Day)中的,并将其替换为行的循环以便跟踪数据框中感兴趣的当前行。



此外,代码中的一个小错误是在

  if(length(temp> 0)){

意图是检查是否存在 Count 大于 0 为感兴趣的日子。然而,R中的条件运算符被向量化,使得 temp> 0 返回与其输入 temp 相同长度的布尔值向量。因此, length(temp> 0)将始终返回一个正数,除非 temp 本身的长度 0 (即为空)。为了得到你想要的东西,这行改为

  if(any(temp> 0)){

更新:关于前几天的新要求



解决新要求的最简单的方法是将代码正文放在 if(any(temp> 0)){...} 将其调用为函数,将其称为 accumulate.mean.count ,然后使用 sapply 。修改是:

  accumulate.mean.count<  -  function(this.day,lag){
condition1 < - df $ Day == this.day - lag
condition1 [which(df $ Day == this.day-lag)< max(which(df $ Day == this.day))]< - FALSE
if(any(condition1)){
df $ Count [df $ Day == this.day] < - mean(df $ Count [condition1])+ df $ Count [df $ Day == this.day]
df $ Count [condition1]<< - 0
}


lags < - seq_len(30)

(i in unique(df $ Day)){
temp < - df $ Count [ df $ Day == i]
if(any(temp> 0)){
sapply(lags,accumulate.mean.count,this.day = i)
}
}

print(df [order(df $ Count,decrease = TRUE),])

注意:


  1. 滞后是数字之前(即那个滞后)当天。 a lag = 1 表示前一天,而 lag = 2 意味着前两天等。 lags 是这些的集合。这里, lags < - code>是从 1 30 应用 accumulate.mean.count ,这是你想要的。请参阅对于 *应用 R函数系列的优秀概述。请注意, lags 不需要是一个序列,而只是一个整数的集合,如 c(1,5,10)前一天,前5天以及前10天。如果您想在未来的日子中滚动,甚至不需要积极,但不应该为零。


  2. 由于 R的词汇作用域规则,设置 df $ Count ,它是 accumulate.mean.count 范围之外的变量,在函数 accumulate.mean.count 需要< < - 而不是 < - $ / code>。有关说明,请参阅请注意使用< - 的危险。


我没有足够的数据来测试 lags< - seq_len(30),但是对于 seq_len(1),我恢复原来的结果,而对于 seq_len(2),我得到了

 code> ##站日计数
## 1 33012 12448 10
## 2 35004 12448 10
## 3 35008 12448 10
## 4 37006 12448 10
## 5 21009 4835 5
## 6 24005 4835 5
## 7 27001 4835 5
## 16 47004 12445 1
## 17 51001 12449 1
## 18 51003 4832 1
## 19 52004 4836 1
## 8 25005 12447 0
## 9 29001 12447 0
## 10 29002 12447 0
## 11 29002 12446 0
## 12 30001 12446 0
## 13 31002 12446 0
## 1 4 47007 4834 0
## 15 49002 4834 0

我相信你会想要。


I need to adjust a code, which works perfectly with my dataframe (but with another set up), in order to select a 2 days time window from the column Day. In particular I am interested in the 1 day prior day0 (i.e. i - 1 and i, where i is the day of interest) and its (i - 1) values contained in the column Count have to be added into the day 0 (i) col Count.

Here an example of my dataframe:

df <- read.table(text = "
        Station   Day           Count
    1    33012  12448               4
    2    35004  12448               4
    3    35008  12448               4
    4    37006  12448               4
    5    21009   4835               3
    6    24005   4835               3
    7    27001   4835               3
    8    25005  12447               3
    9    29001  12447               3
    10   29002  12447               3
    11   29002  12446               3
    12   30001  12446               3
    13   31002  12446               3
    14   47007   4834               2
    15   49002   4834               2
    16   47004  12445               1
    17   51001  12449               1
    18   51003   4832               1
    19   52004   4836               1", header = TRUE)

my output should be:

           Station    Day           Count
        1    33012  12448               7
        2    35004  12448               7
        3    35008  12448               7
        4    37006  12448               7
        5    21009   4835               5
        6    24005   4835               5
        7    27001   4835               5
        8    29002  12446               4
        9    30001  12446               4
        10   31002  12446               4
        11   51001  12449               1
        12   51003   4832               1
        13   52004   4836               1
        14   25005  12447               0
        15   29001  12447               0
        16   29002  12447               0
        17   47007   4834               0
        18   49002   4834               0
        19   47004  12445               0

I am trying this code, but it doesn't work with my real dataframe:

for (i in unique(df$Day)) {
    temp <- df$Count[df$Day == i]  
    if(length(temp > 0)) {  
    condition1 <- df$Day == i - 1   
    if (any(condition1)) {
       df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
       df$Count[condition1] <- 0
            }
         }
}

The code seems right and it has sense but my output is not.

Can anyone helps me?


@aichao code works good.

In the case that I want to consider the previous 30 days (i.e. day-30, day-29, day-28, ...., day-1, day0) is there any quick way to do it, instead of creating 30 if statements (conditions)?

Thanks again @aichao for your help.

解决方案

The following does what you want on the sample data you gave

for (i in unique(df$Day)) {
  temp <- df$Count[df$Day == i]
  if (any(temp > 0)) {
    condition1 <- df$Day == i - 1
    condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE
    if (any(condition1)) {
      df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
      df$Count[condition1] <- 0
    }
  }
}
print(df[order(df$Count, decreasing = TRUE),])
##   Station   Day Count
##1    33012 12448     7
##2    35004 12448     7
##3    35008 12448     7
##4    37006 12448     7
##5    21009  4835     5
##6    24005  4835     5
##7    27001  4835     5
##11   29002 12446     4
##12   30001 12446     4
##13   31002 12446     4
##17   51001 12449     1
##18   51003  4832     1
##19   52004  4836     1
##8    25005 12447     0
##9    29001 12447     0
##10   29002 12447     0
##14   47007  4834     0
##15   49002  4834     0
##16   47004 12445     0

A key requirement gleamed from your comment that was missing from your implementation is that only days that are further down the data frame (in rows) are considered in determining the previous day and its count. That is, you are processing the data frame rows as if they were ordered in time and not considering the values in the Day column as an ordering of time. Therefore, for df$Day = 12449 there is no previous day to consider since all rows with df$Day = 12448 precedes it. As a result, the Count for df$Day = 12449 remains at 1, and more importantly, the Counts for all rows that have df$Day = 12448 are not to be zeroed out after processing df$Day = 12449.

To implement this, we need to further filter condition1 so that we set to FALSE all rows for which df$Day == i - 1 (previous day) that precedes the highest row for which df$Day == i (day of interest) using the line

condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE

Note that this solution assumes that same values for the Day column in the data frame are lumped together as blocks of rows as is in your sample data. Otherwise, your for loop over unique(df$Day) needs to be reconsidered completely and replaced with a loop over rows in order to track the current row for the day of interest in the data frame.

In addition, a minor bug in your code was in the line

if(length(temp > 0)) {

The intent was to check if there are any rows for which the Count is greater than 0 for the day of interest. However, conditional operators in R are vectorized such that temp > 0 returns a vector of booleans that is the same length as its input temp. Therefore, length(temp > 0) will always return a positive number unless temp itself is of length 0 (i.e., empty). To get what you intend, the line is changed to

if(any(temp > 0)) {

Update: new requirement regarding multiple previous days

The simplest way to address the new requirement is to put the body of code within the if (any(temp > 0)) {...} block into a function, call it accumulate.mean.count, and apply this function over a collection of previous days using sapply. The modifications are:

accumulate.mean.count <- function(this.day, lag) {
  condition1 <- df$Day == this.day - lag
  condition1[which(df$Day == this.day - lag) < max(which(df$Day == this.day))] <- FALSE
  if (any(condition1)) {
    df$Count[df$Day == this.day] <<- mean(df$Count[condition1]) + df$Count[df$Day == this.day]
    df$Count[condition1] <<- 0
  }
}

lags <- seq_len(30)

for (i in unique(df$Day)) {
  temp <- df$Count[df$Day == i]
  if (any(temp > 0)) {
    sapply(lags, accumulate.mean.count, this.day=i)
  }
}

print(df[order(df$Count, decreasing = TRUE),])

Notes:

  1. lag is the number of days previous to (i.e., that lag) the current day. A lag = 1 means the previous day, and a lag = 2 means two days previous, etc. lags is a collection of these. Here, lags <- seq_len(30) is a sequence from 1 to 30 over which accumulate.mean.count is applied, which is what you want. See this for an excellent overview on the *apply family of R functions. Note that lags need not be a sequence but just a collection of integers such as c(1, 5, 10) for the previous day, 5 days previous and 10 days previous. It does not even have to be positive if you want to roll in future days, but should not be zero.

  2. Because of the lexical scoping rule of R, setting df$Count, which is a variable outside the scope of accumulate.mean.count, within the function accumulate.mean.count requires <<- instead of <-. See this for an explanation and note the dangers of using <<- mentioned there.

I do not have enough data to test lags <- seq_len(30), but for seq_len(1), I recovered the original result, and for seq_len(2), I got

##   Station   Day Count
##1    33012 12448    10
##2    35004 12448    10
##3    35008 12448    10
##4    37006 12448    10
##5    21009  4835     5
##6    24005  4835     5
##7    27001  4835     5
##16   47004 12445     1
##17   51001 12449     1
##18   51003  4832     1
##19   52004  4836     1
##8    25005 12447     0
##9    29001 12447     0
##10   29002 12447     0
##11   29002 12446     0
##12   30001 12446     0
##13   31002 12446     0
##14   47007  4834     0
##15   49002  4834     0

which I believe is what you would want.

这篇关于对于从列选择循环选择时间窗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆