对于从列选择循环选择时间窗 [英] For loop - select time window from day column
问题描述
我需要调整代码,这与我的数据框(但是另一个设置)完美匹配,以便从列Day中选择一个2天的时间窗口。特别是我有兴趣在前一天0(即i-1和i,其中i是感兴趣的日期),并且它列在Count中的(i - 1)值必须添加到第0天(i )col count。
这里是我的数据框的一个例子:
df< - read.table(text =
/ pre>
站日计数
1 33012 12448 4
2 35004 12448 4
3 35008 12448 4
4 37006 12448 4
5 21009 4835 3
6 24005 4835 3
7 27001 4835 3
8 25005 12447 3
9 29001 12447 3
10 29002 12447 3
11 29002 12446 3
12 30001 12446 3
13 31002 12446 3
14 47007 4834 2
15 49002 4834 2
16 47004 12445 1
17 51001 12449 1
18 51003 4832 1
19 52004 4836 1,header = TRUE)
我的输出应该是:
站日计数
1 33012 12448 7
2 35004 12448 7
3 35008 12448 7
4 37006 12448 7
5 21009 4835 5
6 24005 4835 5
7 27001 4835 5
8 29002 12446 4
9 30001 12446 4
10 31002 12446 4
11 51001 12449 1
12 51003 4832 1
13 52004 4836 1
14 25005 12447 0
15 29001 1244 7 0
16 29002 12447 0
17 47007 4834 0
18 49002 4834 0
19 47004 12445 0
我正在尝试这段代码,但它并不适用于我的真实数据框:
for(i in unique(df $ Day)){
temp < - df $ Count [df $ Day == i]
if(length(temp> 0)){
condition1< - df $ Day == i - 1
if(any(condition1)){
df $ Count [df $ Day == i] (df $ Count [condition1])+ df $ Count [df $ Day == i]
df $ Count [condition1] < - $ $ $ $ $ $ $ $ $ $ $
代码似乎是正确的,它有意义,但我的输出不是。
任何人都可以帮助我吗?
@aichao代码工作正常。如果我想考虑前30天(即第30天,第29天,第28天,....,第1天, day0)有没有快速的方法来做,而不是创建30 if语句(条件)?
再次感谢@aichao的帮助。
解决方案以下是对您给出的示例数据的要求。
code> for(i in unique(df $ Day)){
temp < - df $ Count [df $ Day == i]
if(any(temp> 0)) {
condition1 < - df $ Day == i - 1
condition1 [which(df $ Day == i - 1)< max(which(df $ Day == i))]< - FALSE
if(any(condition1)){
df $ Count [df $ Day == i] $ count [condition1])+ df $ Count [df $ Day == i]
df $ Count [condition1]< - 0
}
}
}
打印(df [order(df $ Count,decrease = TRUE)]]
##站日计数
## 1 33012 12448 7
## 2 35004 12448 7
## 3 35008 12448 7
## 4 37006 12448 7
## 5 21009 4835 5
## 6 24005 4835 5
## 7 27001 4835 5
## 11 29002 12446 4
## 12 30001 12446 4
## 13 31002 12446 4
## 17 51001 12449 1
## 18 51003 4832 1
## 19 52004 4836 1
## 8 25005 12447 0
## 9 29001 12447 0
## 10 29002 12447 0
## 14 47007 4834 0
## 15 49002 4834 0
## 16 47004 12445 0
一个关键要求闪闪发光从你的执行缺失的评论在确定前一天及其计数时,仅考虑数据帧(行)进一步下降的几天。也就是说,您正在处理数据帧行,就像它们被及时订购一样,而不考虑
Day
列中的值作为时间顺序。因此,对于df $ Day = 12449
,因为所有行都有df $ Day = 12448
在它之前。因此,df $ Day = 12449
的计数
保持在1
,更重要的是,对于df $ Day = 12448
的所有行,Counts
不是在处理<零code> df $ Day = 12449 之后,将其清零。
为了实现这一点,我们需要进一步过滤
condition1
,以便我们将fALSE
所有行df $ Day == i - 1
(前一天)在df $ Day == i
(感兴趣的日期)的最高行之前使用行condition1 [which(df $ Day == i - 1) max(which(df $ Day == i))]< - FALSE
请注意,假设数据帧中
Day
列的相同值与样本数据中的行一样集中在一起。否则,需要重新考虑循环unique(df $ Day)
中的,并将其替换为行的循环以便跟踪数据框中感兴趣的当前行。
此外,代码中的一个小错误是在
if(length(temp> 0)){
意图是检查是否存在
Count
大于0
为感兴趣的日子。然而,R中的条件运算符被向量化,使得temp> 0
返回与其输入temp
相同长度的布尔值向量。因此,length(temp> 0)
将始终返回一个正数,除非temp
本身的长度0
(即为空)。为了得到你想要的东西,这行改为if(any(temp> 0)){
更新:关于前几天的新要求
解决新要求的最简单的方法是将代码正文放在
if(any(temp> 0)){...}
将其调用为函数,将其称为accumulate.mean.count
,然后使用sapply
。修改是:accumulate.mean.count< - function(this.day,lag){
condition1 < - df $ Day == this.day - lag
condition1 [which(df $ Day == this.day-lag)< max(which(df $ Day == this.day))]< - FALSE
if(any(condition1)){
df $ Count [df $ Day == this.day] < - mean(df $ Count [condition1])+ df $ Count [df $ Day == this.day]
df $ Count [condition1]<< - 0
}
lags < - seq_len(30)
(i in unique(df $ Day)){
temp < - df $ Count [ df $ Day == i]
if(any(temp> 0)){
sapply(lags,accumulate.mean.count,this.day = i)
}
}
print(df [order(df $ Count,decrease = TRUE),])
注意:
滞后
是数字之前(即那个滞后)当天。 alag = 1
表示前一天,而lag = 2
意味着前两天等。lags
是这些的集合。这里,lags < - code>是从
1
到30
应用accumulate.mean.count
,这是你想要的。请参阅此对于*应用
R函数系列的优秀概述。请注意,lags
不需要是一个序列,而只是一个整数的集合,如c(1,5,10)
前一天,前5天以及前10天。如果您想在未来的日子中滚动,甚至不需要积极,但不应该为零。
由于 R的词汇作用域规则,设置
df $ Count
,它是accumulate.mean.count
范围之外的变量,在函数accumulate.mean.count
需要< < -
而不是< - $ / code>。有关说明,请参阅此请注意使用
< -
的危险。
我没有足够的数据来测试
lags< - seq_len(30)
,但是对于seq_len(1)
,我恢复原来的结果,而对于seq_len(2)
,我得到了code> ##站日计数
## 1 33012 12448 10
## 2 35004 12448 10
## 3 35008 12448 10
## 4 37006 12448 10
## 5 21009 4835 5
## 6 24005 4835 5
## 7 27001 4835 5
## 16 47004 12445 1
## 17 51001 12449 1
## 18 51003 4832 1
## 19 52004 4836 1
## 8 25005 12447 0
## 9 29001 12447 0
## 10 29002 12447 0
## 11 29002 12446 0
## 12 30001 12446 0
## 13 31002 12446 0
## 1 4 47007 4834 0
## 15 49002 4834 0
我相信你会想要。
I need to adjust a code, which works perfectly with my dataframe (but with another set up), in order to select a 2 days time window from the column Day. In particular I am interested in the 1 day prior day0 (i.e. i - 1 and i, where i is the day of interest) and its (i - 1) values contained in the column Count have to be added into the day 0 (i) col Count.
Here an example of my dataframe:
df <- read.table(text = " Station Day Count 1 33012 12448 4 2 35004 12448 4 3 35008 12448 4 4 37006 12448 4 5 21009 4835 3 6 24005 4835 3 7 27001 4835 3 8 25005 12447 3 9 29001 12447 3 10 29002 12447 3 11 29002 12446 3 12 30001 12446 3 13 31002 12446 3 14 47007 4834 2 15 49002 4834 2 16 47004 12445 1 17 51001 12449 1 18 51003 4832 1 19 52004 4836 1", header = TRUE)
my output should be:
Station Day Count 1 33012 12448 7 2 35004 12448 7 3 35008 12448 7 4 37006 12448 7 5 21009 4835 5 6 24005 4835 5 7 27001 4835 5 8 29002 12446 4 9 30001 12446 4 10 31002 12446 4 11 51001 12449 1 12 51003 4832 1 13 52004 4836 1 14 25005 12447 0 15 29001 12447 0 16 29002 12447 0 17 47007 4834 0 18 49002 4834 0 19 47004 12445 0
I am trying this code, but it doesn't work with my real dataframe:
for (i in unique(df$Day)) { temp <- df$Count[df$Day == i] if(length(temp > 0)) { condition1 <- df$Day == i - 1 if (any(condition1)) { df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i] df$Count[condition1] <- 0 } } }
The code seems right and it has sense but my output is not.
Can anyone helps me?
@aichao code works good.
In the case that I want to consider the previous 30 days (i.e. day-30, day-29, day-28, ...., day-1, day0) is there any quick way to do it, instead of creating 30 if statements (conditions)?
Thanks again @aichao for your help.
解决方案The following does what you want on the sample data you gave
for (i in unique(df$Day)) { temp <- df$Count[df$Day == i] if (any(temp > 0)) { condition1 <- df$Day == i - 1 condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE if (any(condition1)) { df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i] df$Count[condition1] <- 0 } } } print(df[order(df$Count, decreasing = TRUE),]) ## Station Day Count ##1 33012 12448 7 ##2 35004 12448 7 ##3 35008 12448 7 ##4 37006 12448 7 ##5 21009 4835 5 ##6 24005 4835 5 ##7 27001 4835 5 ##11 29002 12446 4 ##12 30001 12446 4 ##13 31002 12446 4 ##17 51001 12449 1 ##18 51003 4832 1 ##19 52004 4836 1 ##8 25005 12447 0 ##9 29001 12447 0 ##10 29002 12447 0 ##14 47007 4834 0 ##15 49002 4834 0 ##16 47004 12445 0
A key requirement gleamed from your comment that was missing from your implementation is that only days that are further down the data frame (in rows) are considered in determining the previous day and its count. That is, you are processing the data frame rows as if they were ordered in time and not considering the values in the
Day
column as an ordering of time. Therefore, fordf$Day = 12449
there is no previous day to consider since all rows withdf$Day = 12448
precedes it. As a result, theCount
fordf$Day = 12449
remains at1
, and more importantly, theCounts
for all rows that havedf$Day = 12448
are not to be zeroed out after processingdf$Day = 12449
.To implement this, we need to further filter
condition1
so that we set toFALSE
all rows for whichdf$Day == i - 1
(previous day) that precedes the highest row for whichdf$Day == i
(day of interest) using the linecondition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE
Note that this solution assumes that same values for the
Day
column in the data frame are lumped together as blocks of rows as is in your sample data. Otherwise, yourfor
loop overunique(df$Day)
needs to be reconsidered completely and replaced with a loop over rows in order to track the current row for the day of interest in the data frame.In addition, a minor bug in your code was in the line
if(length(temp > 0)) {
The intent was to check if there are any rows for which the
Count
is greater than0
for the day of interest. However, conditional operators in R are vectorized such thattemp > 0
returns a vector of booleans that is the same length as its inputtemp
. Therefore,length(temp > 0)
will always return a positive number unlesstemp
itself is of length0
(i.e., empty). To get what you intend, the line is changed toif(any(temp > 0)) {
Update: new requirement regarding multiple previous days
The simplest way to address the new requirement is to put the body of code within the
if (any(temp > 0)) {...}
block into a function, call itaccumulate.mean.count
, and apply this function over a collection of previous days usingsapply
. The modifications are:accumulate.mean.count <- function(this.day, lag) { condition1 <- df$Day == this.day - lag condition1[which(df$Day == this.day - lag) < max(which(df$Day == this.day))] <- FALSE if (any(condition1)) { df$Count[df$Day == this.day] <<- mean(df$Count[condition1]) + df$Count[df$Day == this.day] df$Count[condition1] <<- 0 } } lags <- seq_len(30) for (i in unique(df$Day)) { temp <- df$Count[df$Day == i] if (any(temp > 0)) { sapply(lags, accumulate.mean.count, this.day=i) } } print(df[order(df$Count, decreasing = TRUE),])
Notes:
lag
is the number of days previous to (i.e., that lag) the current day. Alag = 1
means the previous day, and alag = 2
means two days previous, etc.lags
is a collection of these. Here,lags <- seq_len(30)
is a sequence from1
to30
over whichaccumulate.mean.count
is applied, which is what you want. See this for an excellent overview on the*apply
family of R functions. Note thatlags
need not be a sequence but just a collection of integers such asc(1, 5, 10)
for the previous day, 5 days previous and 10 days previous. It does not even have to be positive if you want to roll in future days, but should not be zero.Because of the lexical scoping rule of R, setting
df$Count
, which is a variable outside the scope ofaccumulate.mean.count
, within the functionaccumulate.mean.count
requires<<-
instead of<-
. See this for an explanation and note the dangers of using<<-
mentioned there.I do not have enough data to test
lags <- seq_len(30)
, but forseq_len(1)
, I recovered the original result, and forseq_len(2)
, I got## Station Day Count ##1 33012 12448 10 ##2 35004 12448 10 ##3 35008 12448 10 ##4 37006 12448 10 ##5 21009 4835 5 ##6 24005 4835 5 ##7 27001 4835 5 ##16 47004 12445 1 ##17 51001 12449 1 ##18 51003 4832 1 ##19 52004 4836 1 ##8 25005 12447 0 ##9 29001 12447 0 ##10 29002 12447 0 ##11 29002 12446 0 ##12 30001 12446 0 ##13 31002 12446 0 ##14 47007 4834 0 ##15 49002 4834 0
which I believe is what you would want.
这篇关于对于从列选择循环选择时间窗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!