使用dplyr :: do的匿名函数 - 使用rle的结果来过滤数据 [英] Anonymous function with dplyr::do - using results from rle to filter data

查看:167
本文介绍了使用dplyr :: do的匿名函数 - 使用rle的结果来过滤数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有按主题('id')分组的时间序列数据,它们保留在某个网站上,并且在每个时间步骤中都有一定的舞台。



有时科目从一个地点切换到另一个地点,并可能再次返回。如果个人将来回(例如从网站'a'切换到网站'b',然后返回网站'a'),如果只有中间站点的一次注册(在转换aba中,那么站点'b'在这里被认为是一个中间站点)个人处于某个阶段(在这里,stage = 2)在中间站点,然后我希望在此时间步骤删除注册。



我的虚拟数据由四个科目组成。其中三个(主题1-3)已从站点a移动到b,然后返回站点b,一个已从a移动到b。



第一个两个科目在中间站点都有单一注册。主题1在中间站点的第一阶段,我希望保留该注册。另一方面,标题2在中间站点的第二阶段,这个注册应该被删除。主题3,也在a和b之间来回移动。然而,虽然在中间站点b处于第二阶段,但是在那里注册了两个,并且保留了两个注册。主题4已经从站点a移动到b,但是不能再次返回。因此,虽然在现场b的第二阶段,现场登记b不是中间站点,应该保留。



数据:

  df<  -  structure(list(id = c(1,1,1,2,2,2,3,3,3 ,3L,3L,1L,2L,3L,4L,1L,2L),
time = c(1L,2L,3L,1L,2L,3L,1L,2L,3L, a,b,a,a,b,a,a,b,b,a,a,b $ b stage = c(1,1,1,2,1,1,2,2,1,1,2)),
.Names = c(id,time,网站,舞台),
row.names = c(NA,-12L),class =data.frame)

df

# id时间站点阶段
#1 1 1 a 1
#2 1 2 b 1 <~~在现场2的单个中间注册
#3 1 3 a 1但是,个人是在阶段1: - >保持

#4 2 1 a 1
#5 2 2 b 2 <~~在第2阶段的现场2的单中间注册: - >删除
#6 2 3 a 1

#7 3 1 a 1
#8 3 2 b 2< ~~两个中间注册阶段2: - >保持两行
#9 3 3 b 2<〜〜
#10 3 4 a 1

#11 4 1 a 1
#12 4 2 b 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~保持

因此,在测试数据中,只有在时间= 2时注册id = 2应该删除。



以前,我已经使用 plyr :: ddply 并从 rle 解决问题:



对于每个人,计算站点的运行长度( rle(x $ site) 如果

- 站点之间来回(例如从a到b,然后返回到a)
length(r $ values)> 2 &

- 中间站点只有一个注册( r $ length [2] == 1 &

- 中间站点的阶段是2( x $ stage [x $ site == r $ values [2]] [1] == 2

然后:删除中间注册网站 x [!(x $ site == r $ values [2]),]

  library(plyr)

ddply(df,。(id),function(x){
r< - rle(x $ site)
if(length(r $ values)> 2& r $ length [2] == 1& x $ stage [x $ site == r $ values [2]] [1] == 2){
x [x $ site!= r $ values [2],]
} else x
})

#id时间现场舞台
#1 1 1 a 1
#2 1 2 b 1
#3 1 3 a 1

#4 2 1 a 1 <~~在时间2的阶段= 2的单个中间站点被删除
#5 2 3 a 1<〜〜

#6 3 1 a 1
#7 3 2 b 2
#8 3 3 b 2
#9 3 4 a 1

#10 4 1 a 1
#11 4 2 b 2

detach(package:plyr)

现在我有一些麻烦, code> dplyr 。我在SO上找到了一些相关的帖子(例如 )和github( 和< a href =https://github.com/hadley/dplyr/issues/418 =nofollow noreferrer> 这个 ),但我有麻烦适应我们需要。以下是一些绝望的尝试:

 库(dplyr)

df%>%
group_by(id)%>%
do((function(x){
r = rle(x $ site)
if(length(r $ values)> 2& r $ length [2] == 1& df $ stage [df $ site == r $ values [2]] [1] == 2){
filter(x,x $ site!= r $ values [2])
} else x
})(。))
#所需行不被删除

df%>%
group_by id)%>%
do(function(x){
r = rle(x $ site)
if(length(r $ values)> 2& r $ length [2 ] == 1& df $ stage [df $ site == r $ values [2]] [1] == 2){
x [!(x $ site == r $ values [2]), ]
} else x
})
#错误:结果不是位置上的数据框:1,2,3

这种尝试发生了工作(与 ddply 相同的结果),但是远远优雅,我怀疑这是正确的方式:

  df%>%
group_by(id)%>%
do(r = rle(。$ site))%& %
do(data.frame(id =。$ id,
len = length(。$ r $ values),
site =。$ r $ values [2],
len2 =。$ r $ length [2]))%>%
filter(len == 3,len2 == 1)%>%
select(-len)%> ;%
left_join(df,。,by = c(id,site))%>%
过滤器(!(len2%in%1& stage == 2))%>%
select(-len2)

do 这样吗? WWHWD?

解决方案

我不知道我是否完全了解你的代码背后的逻辑,但这可能是另一种获取方式如果需要,可能需要进行一些修改:

  df%>%
group_by(id)% >%
group_by(grp = cumsum(abs(c,1,diff(as.numeric(site)))),add = TRUE)%>%
filter(!(grp = = 2& n()== 1& stage == 2))

#Source:本地数据框[9 x 5]
#Groups:id,grp

#id时间站点阶段grp
#1 1 1 a 1 1
#2 1 2 b 1 2
#3 1 3 a 1 3
# 4 2 1 a 1 1 <〜行之间
#5 2 3 a 1 3< ~~被删除
#6 3 1 a 1 1
#7 3 2 b 2 2
#8 3 3 b 2 2
#9 3 4 a 1 3

这种方法假设中间组总是第二个grp。






创建一个函数可能会更好 - 由于它在分组数据中创建组,所以我将要调用 intergroup()并使用它:

  intergroup<  -  function(var,start = 1){
cumsum(abs (start,diff(as.numeric(as.factor(var))))))
}

df%>%
group_by(id)%>%
group_by(grp = intergroup(site),add = TRUE)%>%
filter(!(grp == 2& n()== 1& stage == 2))






OP问题后编辑



尝试调整后的问题的以下调整代码:

  df%>%
group_by(id)%>%
mutate(z = lag(site,1)!= lead(site,1))%>%#和以后不一样
group_by(grp = intergroup(site),add = TRUE)%>%
filter(!(grp == 2& n()== 1& stage == 2&!is.na(z)))%>%#检查z
中的NA ungroup()%>%select(-c(z,grp))

#资源:本地数据框[11 x 4]

#id时间站点阶段
#1 1 1 a 1
#2 1 2 b 1
#3 1 3 a 1
#4 2 1 a 1
#5 2 3 a 1
#6 3 1 a 1
#7 3 2 b 2
#8 3 3 b 2
#9 3 4 a 1
#10 4 1 a 1
#11 4 2 b 2<〜〜行被保留


I have time series data grouped by subject ('id'), which stay on certain 'site' and have a certain 'stage' in each 'time' step.

Sometimes subjects switch from one site to another, and possibly back again. If individuals switch site back and forth (e.g. from site 'a' to site 'b', and then back to site 'a') and if there is only one registration on the middle site (in a transition a-b-a, then site 'b' would here be considered a 'middle site') and the individual is in a certain stage (here, stage = 2) at the middle site, then I wish to remove the registration at this time step.

My dummy data consists of four subjects. Three of them (subject 1-3) have moved from site a to b, and then back to site b, and one has moved from a to b.

The first two subjects both have a single registration on the middle site. Subject 1 is in stage 1 on the middle site and I wish to keep that registration. Subject 2 on the other hand is in stage 2 on the middle site and this registration should be removed. Subject 3, has also moved back and forth between a and b. However, although it is in stage 2 on the middle site b, it has two registrations there and both registrations are kept. Subject 4 has moved from site a to b, but not back again. Thus, although it is in stage 2 on site b, the registration on site b is not a 'middle site' and should be kept.

The data:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
                     time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
                     site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
                     stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
                     .Names = c("id", "time", "site", "stage"),
                row.names = c(NA, -12L), class = "data.frame")

df

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1 <~~ A single middle registration on site 2
# 3   1    3    a     1     However, the individual is in stage 1: -> keep 

# 4   2    1    a     1
# 5   2    2    b     2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6   2    3    a     1

# 7   3    1    a     1
# 8   3    2    b     2 <~~ Two middle registrations with stage 2: -> keep both rows 
# 9   3    3    b     2 <~~
# 10  3    4    a     1

# 11  4    1    a     1 
# 12  4    2    b     2 <~~ A single registration on site 2 with stage 2,
#                            but it is not in between two sites: -> keep

Thus, in the test data, it is only the registration at time = 2 for id = 2 which should be removed.

Previously, I have used plyr::ddply and result from rle to solve the problem:

For each individual, calculate run lengths of site (rle(x$site))
If:
- back and forth between sites (e.g. from a to b, and back to a) (length(r$values) > 2) &
- only one registration on middle site (r$lengths[2] == 1) &
- stage on middle site is 2 (x$stage[x$site == r$values[2]][1] == 2)
Then: remove registration on middle site x[!(x$site == r$values[2]), ])

library(plyr)

ddply(df, .(id), function(x){
  r <- rle(x$site)
  if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
    x[x$site != r$values[2], ]
  } else x
})

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1

# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5   2    3    a     1 <~~

# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1

# 10  4    1    a     1
# 11  4    2    b     2

detach("package:plyr")

Now I have some trouble getting this right in dplyr. I found some relevant posts on SO (e.g. this and this), and on github (this and this), but I have trouble to adapt them to my needs. Here are some desperate attempts:

library(dplyr)

df %>%
  group_by(id) %>%
  do((function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
    filter(x, x$site != r$values[2])
  } else x
})(.))
# desired row is not removed

df %>%
  group_by(id) %>%
  do(function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
      x[!(x$site == r$values[2]), ]
    } else x
  })
# Error: Results are not data frames at positions: 1, 2, 3

This attempt happens to work (gives same result as ddply above), but is very far from elegant, and I doubt it's 'the right way':

df %>%
  group_by(id) %>%
  do(r = rle(.$site)) %>%  
  do(data.frame(id = .$id,
                len = length(.$r$values),
                site = .$r$values[2],
                len2 = .$r$lengths[2])) %>%
  filter(len == 3, len2 == 1) %>%
  select(-len) %>%
  left_join(df, ., by = c("id", "site")) %>%
  filter(!(len2 %in% 1 & stage == 2)) %>%
  select(-len2)

How to do this properly? WWHWD?

解决方案

I'm not sure whether I fully understood the logic behind your code, but this might be another way to get to the same result, perhaps with some modifications if needed:

df %>% 
  group_by(id) %>%
  group_by(grp = cumsum(abs(c(1, diff(as.numeric(site))))), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

#Source: local data frame [9 x 5]
#Groups: id, grp
#
#  id time site stage grp
#1  1    1    a     1   1
#2  1    2    b     1   2
#3  1    3    a     1   3
#4  2    1    a     1   1     <~~ row in between 
#5  2    3    a     1   3     <~~ was removed
#6  3    1    a     1   1
#7  3    2    b     2   2
#8  3    3    b     2   2
#9  3    4    a     1   3

This approach assumes that the "middle group" is always the second "grp".


Might be even nicer to create a function - which I'm gonna call intergroup() since it creates groups inside the grouped data, and use that:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

df %>% 
  group_by(id) %>%
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))


Edit after OP question update.

Try the following adjusted code for the adjusted problem:

df %>% 
  group_by(id) %>%
  mutate(z = lag(site, 1) != lead(site, 1)) %>%   # check if site before and after are not the same
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2 & !is.na(z))) %>%  # check for NA in z
  ungroup() %>% select(-c(z, grp))  

#Source: local data frame [11 x 4]
#
#   id time site stage
#1   1    1    a     1
#2   1    2    b     1
#3   1    3    a     1
#4   2    1    a     1
#5   2    3    a     1
#6   3    1    a     1
#7   3    2    b     2
#8   3    3    b     2
#9   3    4    a     1
#10  4    1    a     1
#11  4    2    b     2    <~~ row is kept

这篇关于使用dplyr :: do的匿名函数 - 使用rle的结果来过滤数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆