R中数据帧行中第一组连续值的返回列索引 [英] Return column index of first set of consecutive values in data frame row in R

查看:68
本文介绍了R中数据帧行中第一组连续值的返回列索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据框(200k行),其中包含每月的试用数据.每个变量记录该月的试用结果;正(1)或负(0).该文件还包含唯一的ID和许多用于分析的因子变量.这是一个简化的示例说明:

I have a large data frame (200k rows) consisting of monthly trial data. Each variable records the result of the trial in that month; positive (1) or negative (0). The file also contains unique ids and a number of factor variables for use in analysis. Here is a simplified example for illustration:

w <- c(101, 0, 0, 0, 1, 1, 1, 5)
x <- c(102, 0, 0, 0, 0, 0, 0, 3)
y <- c(103, 1, 0, 0, 0, 0, 0, 2)
z <- c(104, 1, 1, 1, 0, 0, 0, 2)
dfrm <- data.frame(rbind(w,x,y,z), row.names = NULL)
names(dfrm) <- c("id","jan","feb","mar","apr","may","jun","start")

所有参加试验的参与者都在不同的时间参加;最后一栏是一个索引,提供了该参与者的第一个试验结果记录在其中的列.参与者加入前几个月的结果记录为零(如示例的第一行所示).

The trial participants all joined at different times; the final column is an index giving the column in which that participant's first trial result is recorded. Results for months prior to the participant joining are recorded as zeros (as in the first row of the example).

我想确定每个参与者三个连续的零的第一个序列,然后返回那个3零序列的开始位置;但自搜索开始以来,我的搜索范围仅限于这些列(从索引列开始的列).

I want to identify the first sequence of three consecutive zeros per participant, and then return the position of the start of that 3-zero sequence; but limiting my search only to the columns since they started the trial (those from the index column onwards).

我的方法-我敢肯定有很多方法-将其分为两个任务:使用for循环将NA写入参与者加入之前的测试结果:

My approach - and I'm sure there are many - has been to split this into two tasks: writing NAs to those test results that occurred before the participant joined, using a for loop:

for (i in 1:nrow(dfrm)){
if(dfrm$start[i] > 2) 
dfrm[i,2:(dfrm$start[i]-1)] <- NA
}

在对整个数据范围使用匹配循环之前,由于流氓早零已被设置为NA:

before using a match loop on the full range of data now that the rogue early zeros have been set to NA:

for (i in 1:nrow(dfrm)){
f <- match(c(0,0,0), dfrm[i,2:7])
dfrm$outputmth[i] <- f[1]
}

dfrm$outputmth <- dfrm$outputmth - (dfrm$start - 2)

(我认为)成功生成了我想要的输出:活动时,每个参与者第一次出现3个连续的零,而没有发现NA的情况.

Which is successful (I think) in generating my desired output: the first occurrence of 3 successive zeros per participant when active, and NA where no occurrence was found.

这涉及一些笨拙的解决方法;尤其是第二个循环返回f中3个值的列表,从中我仅需选择要填充dfrm$outputmth.的第一个项目.但是更重要的是,在完整的数据集上运行此代码大约需要30分钟的时间来执行.因此,感到有点尴尬,我希望至少有一种更有效的方式来编写和运行此方法?

This involved some clunky workarounds; in particular the second loop returning a list of 3 values in f from which I have to select only the first item to populate dfrm$outputmth. But more importantly, running this code on the full data set has taken around 30mins to execute. So, feeling a little embarassed, I'm hoping there is at least one more efficient way to write and run this?

非常感谢您的协助.

推荐答案

我认为您所写的内容不会给出正确的结果...因为match(c(0, 0, 0), ...)与前三个连续的零不匹配,但是而是将零的第一个匹配重复3次.通常,您应该尝试避免for循环在数据框的行上进行迭代,因为它们往往很慢(例如,如果您在循环主体中更改数据框的内容,则会导致创建副本).一种解决方法是使用apply遍历数据帧的行,并使用功能rle检查是否存在三个连续的零

I don't think that what you have written already should give the correct result... Because match(c(0, 0, 0), ...) won't match the first three consecutive zeros but rather give the first match of zero repeated three times. In general you should try to avoid for loops that iterate over the rows of a data frame because they tend to be slow (e.g. if you are altering the contents of the data frame in the body of the loop this causes copies to be created). A workaround is using apply to go over the rows of the data frame and using the function rle to check whether there are three consecutive zeros

dfrm$outputmth <- apply(dfrm[-1], 1, function(x) {
    y <- rle(x[x[7]:6])
    z <- y$values == 0 & y$lengths >= 3
    i <- which(z)[1]
    if (is.na(i)) return(NA)
    if (i == 1) return(x[7])
    return(sum(y$lengths[1:(i-1)]) + x[7])
})

dfrm
#  id jan feb mar apr may jun start outputmth
# 101   0   0   0   1   1   1     5        NA
# 102   0   0   0   0   0   0     3         3
# 103   1   0   0   0   0   0     2         2
# 104   1   1   1   0   0   0     2         4

这篇关于R中数据帧行中第一组连续值的返回列索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆