R:在某个阈值以上的n个连续行中首先选择 [英] R: Selecting first of n consecutive rows above a certain threshold value
问题描述
我需要选择所有 首先
这是数据的示例版本:
MRN Collected_Date ANC
1 001 2015-01-02 0.345
2 001 2015-01-03 0.532
3 001 2015-01-04 0.843
4 001 2015-01-05 0.932
5 002 2015-03-03 0.012
6 002 2015-03-05 0.022
7 002 2015-03-06 0.543
8 002 2015-03-07 0.563
9 003 2015-08-02 0.343
10 003 2015-08-03 0.500
11 003 2015-08-04 0.734
12 003 2015-08-05 0.455
13 004 2014-01-02 0.001
14 004 2014-01-03 0.500
15 004 2014-01-04 0.562
16 004 2014- 01-05 0.503
示例代码:
df< - data.frame(MRN = c('001','001','001','001',
'002','002' '0 02','002',
'003','003','003','003',
'004','004','004','004'),
Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
'03 -03 -2015','03-05-2015','03-06-2015','03-07-2015',
'08 -02-2015','08-03-2015','08 -04-2015','08-05-2015',
'01 -02-2014','01-03-2014','01-04-2014','01-05-2014') ,
format ='%m-%d-%Y'),
ANC = as.numeric(c('0.345','0.532','0.843','0.932',
'0.012','0.022','0.543','0.563',
'0.343','0.500','0.734','0.455',
' ,'0.500','0.562','0.503')))
目前,我正在使用使用滞后函数计算日期差异的非常尴尬的方法,然后过滤所有值> = 0.5,然后总结值,这有助于选择THIRD值的日期。然后我减去两天以获取第一个值的日期:
df%>%group_by(MRN)%> %
mutate(。,days_diff = abs(Collected_Date [1] - Collected_Date))%>%
过滤器(ANC> = 0.5)%>%
mutate(days = + lag((days_diff)))%>%
过滤器(days == 5)%>%
mutate(Collected_Date = Collected_Date - 2)%>%
select(MRN ,Collected_Date)
输出:
:本地数据框架[2 x 2]
组:MRN
MRN Collected_Date
1 001 2015- 01-03
2 004 2014-01-03
必须有一种更简单/更优雅的方式。此外,如果测试日期之间存在差距,则不会给出准确的结果。
此示例所需的输出是:
MRN Collected_Date ANC
1 001 2015-01-03 0.532
2 004 2014-01-03 0.500
所以如果至少有三个连续的测试值是> = 0.5,则应该返回FIRST值的日期。
如果至少有三个连续值> = 0.5,则应返回NA。
任何帮助都非常感谢!
非常感谢!
最简单的方法是使用 zoo
库与 dplyr
一起使用。在 zoo
包中有一个名为 rollapply
的函数,我们可以使用它来计算一个窗口的函数值时间。
在此示例中,我们可以应用该窗口来计算下一个三个值的最小值,然后应用指定的逻辑。
df%>%group_by(MRN)%>%
mutate(ANC = rollapply(ANC,width = 3,min,align = left,fill = NA,na.rm = TRUE))%>%
filter(ANC> = 0.5)%>%
filter(row_number()== 1)
#MRN Collected_Date ANC
#1 001 2015-01-03 0.532
#2 004 2014-01-03 0.500
在上面的代码中,我们使用 rollapply
来计算最少3项。要查看这个工作如何比较以下内容:
rollapply(1:6,width = 3,min,align =left ,fill = NA)#[1] 1 2 3 4 NA NA
pre>
rollapply(1:6,width = 3,min,align =center,fill = NA)#[1] NA 1 2 3 4 NA
rollapply(1:6,width = 3,min,align =right,fill = NA)#[1] NA NA 1 2 3 4
所以在我们的示例中,我们从左边对齐,所以它从当前位置开始,并期待接下来的2个值。
最后,我们过滤了适当的值,并对每个组进行了第一次观察。
I have a data frame with MRN, dates, and a test value.
I need to select all the first rows per MRN that have three consecutive values above 0.5.
This is an example version of the data:
MRN Collected_Date ANC 1 001 2015-01-02 0.345 2 001 2015-01-03 0.532 3 001 2015-01-04 0.843 4 001 2015-01-05 0.932 5 002 2015-03-03 0.012 6 002 2015-03-05 0.022 7 002 2015-03-06 0.543 8 002 2015-03-07 0.563 9 003 2015-08-02 0.343 10 003 2015-08-03 0.500 11 003 2015-08-04 0.734 12 003 2015-08-05 0.455 13 004 2014-01-02 0.001 14 004 2014-01-03 0.500 15 004 2014-01-04 0.562 16 004 2014-01-05 0.503
Example code:
df <- data.frame(MRN = c('001','001','001','001', '002','002','002','002', '003','003','003','003', '004','004','004','004'), Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015', '03-03-2015','03-05-2015','03-06-2015','03-07-2015', '08-02-2015','08-03-2015','08-04-2015','08-05-2015', '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), format = '%m-%d-%Y'), ANC = as.numeric(c('0.345','0.532','0.843','0.932', '0.012','0.022','0.543','0.563', '0.343','0.500','0.734','0.455', '0.001','0.500','0.562','0.503')))
Currently, I am using a very awkward approach using the lag function to calculate the date difference, then filter for all values >= 0.5, and then sum up the values, which helps to select the date of the THIRD value. I then substract two days to get the date of the first value:
df %>% group_by(MRN) %>% mutate(., days_diff = abs(Collected_Date[1] - Collected_Date)) %>% filter(ANC >= 0.5) %>% mutate(days = days_diff + lag((days_diff))) %>% filter(days == 5) %>% mutate(Collected_Date = Collected_Date - 2) %>% select(MRN, Collected_Date)
Output:
Source: local data frame [2 x 2] Groups: MRN
MRN Collected_Date 1 001 2015-01-03 2 004 2014-01-03
There must be a way simpler / more elegant way. Also, it does not give accurate results if there are gaps between the test dates.
My desired output for this example is:
MRN Collected_Date ANC 1 001 2015-01-03 0.532 2 004 2014-01-03 0.500
So if at least three consecutive test values are >= 0.5, the date of the FIRST value should be returned.
If there are not at least three consecutive values >= 0.5, NA should be returned.
Any help is greatly appreciated!
Thank you very much!
解决方案The easiest way is to use the
zoo
library in conjunction withdplyr
. Within thezoo
package there is a function calledrollapply
, we can use this to calculate a function value for a window of time.In this example, we could apply the window to calculate the minimum of the next three values, and then apply the logic specified.
df %>% group_by(MRN) %>% mutate(ANC=rollapply(ANC, width=3, min, align="left", fill=NA, na.rm=TRUE)) %>% filter(ANC >= 0.5) %>% filter(row_number() == 1) # MRN Collected_Date ANC # 1 001 2015-01-03 0.532 # 2 004 2014-01-03 0.500
In the code above we have used
rollapply
to calculate the minimum of the next 3 items. To see how this works compare the following:rollapply(1:6, width=3, min, align="left", fill=NA) # [1] 1 2 3 4 NA NA rollapply(1:6, width=3, min, align="center", fill=NA) # [1] NA 1 2 3 4 NA rollapply(1:6, width=3, min, align="right", fill=NA) # [1] NA NA 1 2 3 4
So in our example, we have aligned from the left, so it starts from the current location and looks forward to the next 2 values.
Lastly we filter by the appropriate values, and take the first observation of each group.
这篇关于R:在某个阈值以上的n个连续行中首先选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!