R:在某个阈值以上的n个连续行中首先选择 [英] R: Selecting first of n consecutive rows above a certain threshold value

查看:138
本文介绍了R:在某个阈值以上的n个连续行中首先选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含MRN,日期和测试值的数据框。



我需要选择所有 首先



这是数据的示例版本:

  MRN Collected_Date ANC 
1 001 2015-01-02 0.345
2 001 2015-01-03 0.532
3 001 2015-01-04 0.843
4 001 2015-01-05 0.932
5 002 2015-03-03 0.012
6 002 2015-03-05 0.022
7 002 2015-03-06 0.543
8 002 2015-03-07 0.563
9 003 2015-08-02 0.343
10 003 2015-08-03 0.500
11 003 2015-08-04 0.734
12 003 2015-08-05 0.455
13 004 2014-01-02 0.001
14 004 2014-01-03 0.500
15 004 2014-01-04 0.562
16 004 2014- 01-05 0.503

示例代码:

  df<  -  data.frame(MRN = c('001','001','001','001',
'002','002' '0 02','002',
'003','003','003','003',
'004','004','004','004'),
Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
'03 -03 -2015','03-05-2015','03-06-2015','03-07-2015',
'08 -02-2015','08-03-2015','08 -04-2015','08-05-2015',
'01 -02-2014','01-03-2014','01-04-2014','01-05-2014') ,
format ='%m-%d-%Y'),
ANC = as.numeric(c('0.345','0.532','0.843','0.932',
'0.012','0.022','0.543','0.563',
'0.343','0.500','0.734','0.455',
' ,'0.500','0.562','0.503')))

目前,我正在使用使用滞后函数计算日期差异的非常尴尬的方法,然后过滤所有值> = 0.5,然后总结值,这有助于选择THIRD值的日期。然后我减去两天以获取第一个值的日期:

  df%>%group_by(MRN)%> %
mutate(。,days_diff = abs(Collected_Date [1] - Collected_Date))%>%
过滤器(ANC> = 0.5)%>%
mutate(days = + lag((days_diff)))%>%
过滤器(days == 5)%>%
mutate(Collected_Date = Collected_Date - 2)%>%
select(MRN ,Collected_Date)

输出:



:本地数据框架[2 x 2]
组:MRN

  MRN Collected_Date 
1 001 2015- 01-03
2 004 2014-01-03

必须有一种更简单/更优雅的方式。此外,如果测试日期之间存在差距,则不会给出准确的结果。



此示例所需的输出是:

  MRN Collected_Date ANC 
1 001 2015-01-03 0.532
2 004 2014-01-03 0.500

所以如果至少有三个连续的测试值是> = 0.5,则应该返回FIRST值的日期。



如果至少有三个连续值> = 0.5,则应返回NA。



任何帮助都非常感谢!



非常感谢!

解决方案

最简单的方法是使用 zoo 库与 dplyr 一起使用。在 zoo 包中有一个名为 rollapply 的函数,我们可以使用它来计算一个窗口的函数值时间。



在此示例中,我们可以应用该窗口来计算下一个三个值的最小值,然后应用指定的逻辑。

  df%>%group_by(MRN)%>%
mutate(ANC = rollapply(ANC,width = 3,min,align = left,fill = NA,na.rm = TRUE))%>%
filter(ANC> = 0.5)%>%
filter(row_number()== 1)

#MRN Collected_Date ANC
#1 001 2015-01-03 0.532
#2 004 2014-01-03 0.500






在上面的代码中,我们使用 rollapply 来计算最少3项。要查看这个工作如何比较以下内容:

  rollapply(1:6,width = 3,min,align =left ,fill = NA)#[1] 1 2 3 4 NA NA 
rollapply(1:6,width = 3,min,align =center,fill = NA)#[1] NA 1 2 3 4 NA
rollapply(1:6,width = 3,min,align =right,fill = NA)#[1] NA NA 1 2 3 4
pre>

所以在我们的示例中,我们从左边对齐,所以它从当前位置开始,并期待接下来的2个值。



最后,我们过滤了适当的值,并对每个组进行了第一次观察。


I have a data frame with MRN, dates, and a test value.

I need to select all the first rows per MRN that have three consecutive values above 0.5.

This is an example version of the data:

   MRN Collected_Date   ANC
1  001     2015-01-02 0.345
2  001     2015-01-03 0.532
3  001     2015-01-04 0.843
4  001     2015-01-05 0.932
5  002     2015-03-03 0.012
6  002     2015-03-05 0.022
7  002     2015-03-06 0.543
8  002     2015-03-07 0.563
9  003     2015-08-02 0.343
10 003     2015-08-03 0.500
11 003     2015-08-04 0.734
12 003     2015-08-05 0.455
13 004     2014-01-02 0.001
14 004     2014-01-03 0.500
15 004     2014-01-04 0.562
16 004     2014-01-05 0.503

Example code:

df <- data.frame(MRN = c('001','001','001','001',
                         '002','002','002','002',
                         '003','003','003','003',
                         '004','004','004','004'), 
                 Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
                                            '03-03-2015','03-05-2015','03-06-2015','03-07-2015',
                                            '08-02-2015','08-03-2015','08-04-2015','08-05-2015',
                                            '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), 
                                            format = '%m-%d-%Y'), 
                 ANC = as.numeric(c('0.345','0.532','0.843','0.932',
                         '0.012','0.022','0.543','0.563',
                         '0.343','0.500','0.734','0.455',
                         '0.001','0.500','0.562','0.503')))

Currently, I am using a very awkward approach using the lag function to calculate the date difference, then filter for all values >= 0.5, and then sum up the values, which helps to select the date of the THIRD value. I then substract two days to get the date of the first value:

   df %>% group_by(MRN) %>% 
    mutate(., days_diff = abs(Collected_Date[1] - Collected_Date)) %>% 
        filter(ANC >= 0.5) %>%
            mutate(days = days_diff + lag((days_diff))) %>%
                filter(days == 5) %>%
                    mutate(Collected_Date = Collected_Date - 2) %>%
                        select(MRN, Collected_Date)

Output:

Source: local data frame [2 x 2] Groups: MRN

  MRN Collected_Date
1 001     2015-01-03
2 004     2014-01-03

There must be a way simpler / more elegant way. Also, it does not give accurate results if there are gaps between the test dates.

My desired output for this example is:

   MRN Collected_Date   ANC     
1  001     2015-01-03 0.532
2  004     2014-01-03 0.500

So if at least three consecutive test values are >= 0.5, the date of the FIRST value should be returned.

If there are not at least three consecutive values >= 0.5, NA should be returned.

Any help is greatly appreciated!

Thank you very much!

解决方案

The easiest way is to use the zoo library in conjunction with dplyr. Within the zoo package there is a function called rollapply, we can use this to calculate a function value for a window of time.

In this example, we could apply the window to calculate the minimum of the next three values, and then apply the logic specified.

df %>% group_by(MRN) %>%
  mutate(ANC=rollapply(ANC, width=3, min, align="left", fill=NA, na.rm=TRUE)) %>%
  filter(ANC >= 0.5) %>%  
  filter(row_number() == 1)

#   MRN Collected_Date   ANC
# 1 001     2015-01-03 0.532
# 2 004     2014-01-03 0.500


In the code above we have used rollapply to calculate the minimum of the next 3 items. To see how this works compare the following:

rollapply(1:6, width=3, min, align="left", fill=NA) # [1]  1  2  3  4 NA NA
rollapply(1:6, width=3, min, align="center", fill=NA) # [1] NA  1  2  3  4 NA
rollapply(1:6, width=3, min, align="right", fill=NA) # [1] NA NA  1  2  3  4

So in our example, we have aligned from the left, so it starts from the current location and looks forward to the next 2 values.

Lastly we filter by the appropriate values, and take the first observation of each group.

这篇关于R:在某个阈值以上的n个连续行中首先选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆