如何在R中的组中的两个变量的组合上选择具有特定值的行 [英] How to select rows with certain values on a combination of two variables within a group in R

查看:150
本文介绍了如何在R中的组中的两个变量的组合上选择具有特定值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前问过的R问题的扩展:如何在R中的组中选择具有特定值的行



我在该问题上获得了很大的帮助,但是现在有点复杂了,我希望收到如何处理的建议。



我的数据如下:

  dd<-read.table(text = 
event.timeline.ys ID年组结果
1 2 800033 2008 A 3
2 1 800033 2009 A 3
3 0 800033 2010 A NA
4 -1 800033 2011 A 2
5 -2 800033 2012 A 1
15 0 800076 2008 B 2
16 -1 80007 6 2009 B NA
17 5 800100 2014 C 4
18 4 800100 2015 C 4
19 2 800100 2017 C 4
20 1 800100 2018 C 3
30 0 800125 2008 A 2
31 -1 800125 2009 A 1
32 -2 800125 2010 A NA
33 2 800031 2008 A 3
34 1 800031 2009 A 3
35 0 800031 2010 A NA
36 -1 800031 2011 A NA
37 -2 800031 2012 A 1,header = TRUE)

我只想选择组(ID)中的特殊行。应该根据以下过程选择这些行:



如果可能的话,我想为每个参与者在event.timeline.ys上保留正值(也就是说,ID组中具有event.timeline.ys> = 0的最后一行,其中结果变量不是NA但具有有效值(例如,对于ID == 800033,这将是第2行)。



此外,我想在第一行中为每个参与者在event.timeline.ys上保留负值(即,具有事件的ID组中的第一行) .timeline.ys< 0),其中结果变量不是NA(例如,对于ID == 800033,这将是第4行)。



在特殊情况下当event.timeline.ys<的ID == 800076时,结果变量上没有任何非NA值。 0,我仍然想保留event.timeline.ys< 0。



ID = 800100的人的event.timeline.ys没有任何负值。在这种情况下,我只想保留event.timeline.ys> = 0的最后一行。



所有其他行都应删除。最终数据框架如下所示:

  event.timeline.ys ID年组结果
2 1 800033 2009 A 3
4 -1 800033 2011 A 2
15 0 800076 2008 B 2
16 -1 800076 2009 B NA
20 1 800100 2018 C 3
30 0 800125 2008 A 2
31 -1 800125 2009 A 1
34 1 800031 2009 A 3
37 -2 800031 2012 A 1
  dd%>%
group_by(ID)%>%
filter(row_number()== last(which(event.timeline.ys> = 0& result> = 0))|
row_number()== first(which(event.timeline.ys< 0& result> = 0)))

但是,我然后丢失了第16行(对于ID == 800076)。



非常感谢!

解决方案

使用 dplyr

  dd%&%;%
group_by(ID,event.timeline.ys> = 0)%>%
range(ID,event.timeline.ys> = 0,abs(event.timeline.ys))%>%
filter(!is.na(outcome)| n()== 1)%&%;%
filter(row_number()== 1)%&%;%
ungroup()% >%
select(-one_of('event.timeline.ys> = 0'))

输出:

  event.timeline.ys ID年组结果
< int> < int> < int> < fct> < int>
1 -1 800033 2011 A 2
2 1 800033 2009 A 3
3 -1 800076 2009 B NA
4 0 800076 2008 B 2
5 1 800100 2018 C 3
6 -1 800125 2009 A 1
7 0 800125 2008 A 2


This is an extension of the R problem I asked earlier: How to select rows with certain values within a group in R

I got great help on that issue, but it got a bit more complicated now and I hope to receive advices how to handle this.

My Data looks like this:

dd <- read.table(text="
    event.timeline.ys     ID     year    group  outcome
                 1                   2     800033 2008    A  3
                 2                   1     800033 2009    A  3
                 3                   0     800033 2010    A  NA   
                 4                  -1     800033 2011    A  2  
                 5                  -2     800033 2012    A  1  
                 15                  0     800076 2008    B  2
                 16                 -1     800076 2009    B  NA
                 17                  5     800100 2014    C  4     
                 18                  4     800100 2015    C  4  
                 19                  2     800100 2017    C  4  
                 20                  1     800100 2018    C  3   
                 30                  0     800125 2008    A  2   
                 31                 -1     800125 2009    A  1   
                 32                 -2     800125 2010    A  NA
                 33                  2     800031 2008    A  3
                 34                  1     800031 2009    A  3
                 35                  0     800031 2010    A  NA   
                 36                 -1     800031 2011    A  NA  
                 37                 -2     800031 2012    A  1", header=TRUE)

I would like to select only special rows within a group (ID). These rows should be selected according to the following procedure:

If possible I would like to keep the last row with a positive value on event.timeline.ys for each participant (i.e., last row within an ID-group with event.timeline.ys >= 0) in which the outcome variable is not NA but has a valid value (e.g., for ID == 800033 this would be row 2).

Additionally, I would like to keep the first row with a negative value on event.timeline.ys for each participant (i.e., first row within an ID-group with event.timeline.ys < 0) in which the outcome variable is not NA (e.g., for ID == 800033 this would be row 4).

In the special case of ID == 800076 that does not have any non-NA values on the outcome variable when event.timeline.ys < 0, I would still like to keep the first row in which event.timeline.ys < 0.

The person with the ID = 800100 does not have any negative values on event.timeline.ys. In this case, I would like to keep only the last row with event.timeline.ys >= 0.

All other rows should be deleted. The final data frame would look like this:

      event.timeline.ys         ID     year    group  outcome
2                     1     800033     2009    A            3
4                    -1     800033     2011    A            2  
15                    0     800076     2008    B            2
16                   -1     800076     2009    B           NA
20                    1     800100     2018    C            3   
30                    0     800125     2008    A            2   
31                   -1     800125     2009    A            1
34                    1     800031     2009    A            3
37                   -2     800031     2012    A            1

I very much appreciate advices on how to solve this problem. I already tried this:

dd %>% 
  group_by(ID) %>% 
  filter(row_number() == last(which(event.timeline.ys >= 0 & outcome >= 0)) | 
           row_number() == first(which(event.timeline.ys < 0 & outcome >= 0)))

However, I then lose the row 16 (for ID == 800076) which is unfortunate.

Many thanks in advance!

解决方案

Using dplyr:

dd %>%
group_by(ID, event.timeline.ys>=0) %>%
arrange(ID, event.timeline.ys>=0, abs(event.timeline.ys)) %>%
filter(!is.na(outcome) | n()==1) %>%
filter(row_number()==1) %>%
ungroup() %>%
select(-one_of('event.timeline.ys >= 0'))

Output:

  event.timeline.ys     ID  year group outcome
              <int>  <int> <int> <fct>   <int>
1                -1 800033  2011 A           2
2                 1 800033  2009 A           3
3                -1 800076  2009 B          NA
4                 0 800076  2008 B           2
5                 1 800100  2018 C           3
6                -1 800125  2009 A           1
7                 0 800125  2008 A           2

这篇关于如何在R中的组中的两个变量的组合上选择具有特定值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆