根据其他列替换数据框中的列值 [英] Replace column value in a data frame based on other columns

查看:75
本文介绍了根据其他列替换数据框中的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我按名称和时间排序以下数据框。

I have the following data frame ordered by name and time.

set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)), 
                 'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
                 'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
                 )
> df
   name time score
1     x    1     0
2     x    1     4
3     x    2     3
4     x    2     5
5     x    2     0
6     x    3     1
7     y    1     5
8     y    2     0
9     y    3     5
10    y    4     8

df $ score 中有零,后跟未知数量的实际值,即 df [1:4,] ,有时两个 df $ score之间有重叠的 df $ name == 0 ,即 df [6:7,]

In df$score there are zeros followed by an unknown number of actual values, i.e. df[1:4,], and sometimes there are overlapping df$name between two df$score == 0, i.e. df[6:7,].

我要更改 df $ time ,其中 df $ score!= 0 。具体来说,如果 df $ name df $ score == 0 分配最接近的上一行的时间值>>是匹配的。

I want to change df$time where df$score != 0. Specifically, I want to assign the time value of the closest upper row with df$score == 0 if df$name is matching.

以下代码给出了很好的输出,但是我的数据有数百万行,因此此解决方案效率很低。

The following code gives the good output but my data have millions of rows so this solution is very inefficient.

score_0 <- append(which(df$score == 0), dim(df)[1] + 1)

for(i in 1:(length(score_0) - 1)) {
  df$time[score_0[i]:(score_0[i + 1] - 1)] <-
    ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]], 
           df$time[score_0[i]], 
           df$time[score_0[i]:(score_0[i + 1] - 1)])
 }

> df
   name time score
1     x    1     0
2     x    1     4
3     x    1     3
4     x    1     5
5     x    2     0
6     x    2     1
7     y    1     5
8     y    2     0
9     y    2     5
10    y    2     8

其中分数_0 给出索引,其中 df $ score == 0 。我们看到 df $ time [2:4] 现在都等于1,即 df $ time [6:7] 仅更改了第一个,因为第二个更改为 df $ name =='y',最接近的上一行更改为 df $ score = = 0 df $ name =='x'。最后两行也已正确更改。

Where score_0 gives the index where df$score == 0. We see that df$time[2:4] are now all equal to 1, that in df$time[6:7] only the first one changed because the second have df$name == 'y' and the closest upper row with df$score == 0 has df$name == 'x'. The last two rows also have changed correctly.

推荐答案

您可以这样做:

library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>% 
       group_by(name,ID) %>% mutate(time = head(time,1)) %>% 
       ungroup() %>%  select(name,time,score) %>% as.data.frame()

#       name time  score
# 1     x    1     0
# 2     x    1     8
# 3     x    1    10
# 4     x    1     6
# 5     x    2     0
# 6     x    2     5
# 7     y    1     4
# 8     y    2     0
# 9     y    2     5
# 10    y    2     9

这篇关于根据其他列替换数据框中的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆