根据其他列替换数据框中的列值 [英] Replace column value in a data frame based on other columns
问题描述
我按名称和时间排序以下数据框。
I have the following data frame ordered by name and time.
set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)),
'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
)
> df
name time score
1 x 1 0
2 x 1 4
3 x 2 3
4 x 2 5
5 x 2 0
6 x 3 1
7 y 1 5
8 y 2 0
9 y 3 5
10 y 4 8
在 df $ score
中有零,后跟未知数量的实际值,即 df [1:4,]
,有时两个 df $ score之间有重叠的
,即 df $ name
== 0 df [6:7,]
。
In df$score
there are zeros followed by an unknown number of actual values, i.e. df[1:4,]
, and sometimes there are overlapping df$name
between two df$score == 0
, i.e. df[6:7,]
.
我要更改 df $ time
,其中 df $ score!= 0
。具体来说,如果 df $ name
df $ score == 0 分配最接近的上一行的时间值>>是匹配的。
I want to change df$time
where df$score != 0
. Specifically, I want to assign the time value of the closest upper row with df$score == 0
if df$name
is matching.
以下代码给出了很好的输出,但是我的数据有数百万行,因此此解决方案效率很低。
The following code gives the good output but my data have millions of rows so this solution is very inefficient.
score_0 <- append(which(df$score == 0), dim(df)[1] + 1)
for(i in 1:(length(score_0) - 1)) {
df$time[score_0[i]:(score_0[i + 1] - 1)] <-
ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]],
df$time[score_0[i]],
df$time[score_0[i]:(score_0[i + 1] - 1)])
}
> df
name time score
1 x 1 0
2 x 1 4
3 x 1 3
4 x 1 5
5 x 2 0
6 x 2 1
7 y 1 5
8 y 2 0
9 y 2 5
10 y 2 8
其中分数_0
给出索引,其中 df $ score == 0
。我们看到 df $ time [2:4]
现在都等于1,即 df $ time [6:7]
仅更改了第一个,因为第二个更改为 df $ name =='y'
,最接近的上一行更改为 df $ score = = 0
的 df $ name =='x'
。最后两行也已正确更改。
Where score_0
gives the index where df$score == 0
. We see that df$time[2:4]
are now all equal to 1, that in df$time[6:7]
only the first one changed because the second have df$name == 'y'
and the closest upper row with df$score == 0
has df$name == 'x'
. The last two rows also have changed correctly.
推荐答案
您可以这样做:
library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>%
group_by(name,ID) %>% mutate(time = head(time,1)) %>%
ungroup() %>% select(name,time,score) %>% as.data.frame()
# name time score
# 1 x 1 0
# 2 x 1 8
# 3 x 1 10
# 4 x 1 6
# 5 x 2 0
# 6 x 2 5
# 7 y 1 4
# 8 y 2 0
# 9 y 2 5
# 10 y 2 9
这篇关于根据其他列替换数据框中的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!