R对连续的重复行求和,并删除除第一个之外的所有行 [英] R sum consecutive duplicate rows and remove all but first

查看:530
本文介绍了R对连续的重复行求和,并删除除第一个之外的所有行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我陷入了一个可能很简单的问题-如何对连续重复的行求和并删除除第一行外的所有行.并且,如果两个重复项之间有一个NA(例如2,na,2),也要对它们进行求和并删除除第一个条目以外的所有条目. 到目前为止一切顺利,这是我的示例数据

ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))

样本输出

     a
   ia time
1   1  4.5
2   1  2.4
3   2  3.6
4  NA  1.5
5   2  1.2
6   1  4.9
7   1  6.4
8   1  4.4
9   1  4.7
10  2  7.3
11  1  2.3
12  2  4.3

现在我想 1.)对连续ia的时间"列求和-即,如果数字1紧接出现两次或多次,则求和时间,在我的情况下,此处将列时间的第一行和第二行求和为4.5+2.4. >

2.)如果两个相同的数字(ia列)之间有一个NA,则还要对所有这些时间求和.

3.)仅保留ia的第一次出现,然后删除其余的.

最后,我想拥有这样的东西:

 a
       ia time
    1   1  6.9
    3   2  6.3
    6   1  20.4
    10  2  7.3
    11  1  2.3
    12  2  4.3

我发现这是求和的结果,但它没有考虑连续因素

aggregate(time~ia,data=a,FUN=sum)

我发现这是要删除的

a[cumsum(rle(as.numeric(a[,1]))$lengths),]

尽管rle方法保留了最后一个条目,但我想保留第一个.我也不知道如何处理NAs.

如果我有一个1-NA-2模式,则NA不应与任何一个一起计数,在这种情况下,应删除NA行.

解决方案

您首先需要将NA序列替换为它们周围的值(如果它们相同). 此答案显示了动物园的na.locf函数,该函数用最近的观察值填充了NA.通过测试向后或向前携带值时是否相同,可以过滤掉不需要的NA,然后进行结转:

library(dplyr)
library(zoo)

a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia))
#>    ia time
#> 1   1  4.5
#> 2   1  2.4
#> 3   2  3.6
#> 4   2  1.5
#> 5   2  1.2
#> 6   1  4.9
#> 7   1  6.4
#> 8   1  4.4
#> 9   2  7.3
#> 10  1  2.3
#> 11  2  4.3

现在,您已经修复了这些NA,现在可以使用cumsum对连续的值集进行分组.完整的解决方案是:

 result <- a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia)) %>%
  mutate(change = ia != lag(ia, default = FALSE)) %>%
  group_by(group = cumsum(change), ia) %>%
  summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#> 
#>   group    ia  time
#>   (int) (dbl) (dbl)
#> 1     1     1   6.9
#> 2     2     2   6.3
#> 3     3     1  15.7
#> 4     4     2   7.3
#> 5     5     1   2.3
#> 6     6     2   4.3
 

如果要摆脱group列,请使用其他行:

result %>%
  ungroup() %>%
  select(-group)

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry. So far so good, here is my sample data

ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))

sample output

     a
   ia time
1   1  4.5
2   1  2.4
3   2  3.6
4  NA  1.5
5   2  1.2
6   1  4.9
7   1  6.4
8   1  4.4
9   1  4.7
10  2  7.3
11  1  2.3
12  2  4.3

Now I want to 1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.

2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.

3.) keep only first occurence of the ia, and delete the rest.

In the end, I would want to have something like this:

 a
       ia time
    1   1  6.9
    3   2  6.3
    6   1  20.4
    10  2  7.3
    11  1  2.3
    12  2  4.3

I found this for summing, but it does not take into account the consecutive factor

aggregate(time~ia,data=a,FUN=sum)

and I found this for deleting

a[cumsum(rle(as.numeric(a[,1]))$lengths),]

although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.

if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.

解决方案

You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:

library(dplyr)
library(zoo)

a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia))
#>    ia time
#> 1   1  4.5
#> 2   1  2.4
#> 3   2  3.6
#> 4   2  1.5
#> 5   2  1.2
#> 6   1  4.9
#> 7   1  6.4
#> 8   1  4.4
#> 9   2  7.3
#> 10  1  2.3
#> 11  2  4.3

Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:

result <- a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia)) %>%
  mutate(change = ia != lag(ia, default = FALSE)) %>%
  group_by(group = cumsum(change), ia) %>%
  summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#> 
#>   group    ia  time
#>   (int) (dbl) (dbl)
#> 1     1     1   6.9
#> 2     2     2   6.3
#> 3     3     1  15.7
#> 4     4     2   7.3
#> 5     5     1   2.3
#> 6     6     2   4.3

If you want to get rid of the group column, use the additional lines:

result %>%
  ungroup() %>%
  select(-group)

这篇关于R对连续的重复行求和,并删除除第一个之外的所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆