带索引的for循环的替代方法-R [英] Alternatives to a for loop with indexing - R
问题描述
我正在将非结构化数据转换为长格式,并且需要创建一个ID(分组)变量.我想基于另一个变量中包含的值集分配一个ID变量.更具体地说,请考虑以下数据集.
I am converting unstructured data into a long format and need to create an ID (grouping) variable. I want to assign an ID variable based on sets of values contained in another variable. More specifically, consider the following data set.
set.seed(1234); x.1 <- rep(letters[1:5], 10)
x.2 <- sample(c(0:10), 50, replace=TRUE)
x.3 <- rep(NA, 50); df <- data.frame(x.1, x.2, x.3)
df <- df[-c(2, 19),]
可以从x.1变量中识别出唯一的情况-以 a
开头,以 e
结尾.总是这样.x.3将保存ID(分组)变量.
A unique case can be identified from the x.1 variable -- it starts with a
and ends with e
. This is always the case. x.3 will hold the ID (grouping) variable.
> head(df, 9)
x.1 x.2 x.3
a 1 NA
c 6 NA
d 6 NA
e 9 NA
a 7 NA
b 0 NA
c 2 NA
d 7 NA
e 5 NA
在给定情况下,在 a
和 e
之间的记录数量可能有很大差异(在实际数据文件中).因此,我不能通过简单地将变量除以固定数量的记录来分配唯一的ID.我想出了如何通过使用for循环来进行正确的分配:
The number of records between a
and e
for a given case can vary considerably (in the real data file). Thus, I cannot assign a unique ID by simply dividing the variable by a fixed number of records. I figured out how to make the proper assignment by using a for loop:
START <- which(df$x.1== "a")
END <- which(df$x.1 == "e")
for(i in 1:length(START)){df$x.3[START[i]:END[i]] <- i}
head(df, 9)
x.1 x.2 x.3
a 1 1
c 6 1
d 6 1
e 9 1
a 7 2
b 0 2
c 2 2
d 7 2
e 5 2
这种方法的明显问题是,对于拥有超过一百万条记录的数据集来说,它太慢了.看起来 lapply
可能是一种替代方法,但是我似乎无法弄清楚如何指定案例的结束时间和新案例的开始时间,因为它遍历数据文件.而且,如果有答案,可以随时将我指向一个已有的答案-我没有对它进行罚款!
The obvious problem with this approach is that it is much too slow for a data set with over one million records. It seems that lapply
could be an alternative, but I can't seem to figure out how to specify when a case ends and a new one begins as it traverses down through the data file. And, feel free to point me to an existing answer if one exists -- I didn't fine one!
先谢谢了.
推荐答案
如果组之间没有空格,即在下一个组的每个"e"后面跟随一个"a",则可以使用 cumsum 代码>轻松:
If there are no gaps between groups, i.e. after each "e" follows an "a" for the next group, you can use cumsum
easily:
df$x.3 <- cumsum(df$x.1 == "a")
df
# x.1 x.2 x.3
#1 a 1 1
#3 c 6 1
#4 d 6 1
#5 e 9 1
#6 a 7 2
#7 b 0 2
#8 c 2 2
#9 d 7 2
#10 e 5 2
#11 a 7 3
#12 b 5 3
#13 c 3 3
#...
如果您的数据非常庞大,则可以使用data.table通过引用来更新数据:
And if your data was enormously large you could use data.table to update the data by reference:
library(data.table)
setDT(df)[, x.3 := cumsum(x.1 == "a")]
正如@nicola在评论中正确指出的那样,这假定 a
仅出现在组的开头,而不出现在组的中间.根据样本数据,这似乎是一个有效的假设.
As correctly noted by @nicola in the comments, this assumes that a
s only appear at beginngs of groups, not in the middle of them. Based on the sample data, this seems like a valid assumption.
工作原理:
让我们取一列"x.1"的子集:
Let's take a subset of column "x.1":
x <- df$x.1[1:15]
x
# [1] a c d e a b c d e a b c d e a
#Levels: a b c d e
您现在可以检查x是否等于"a",这将创建一个逻辑向量:
You can now check if x is equal to "a" which will create a logical vector:
x == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
现在, cumsum
会做什么:它将累积的所有TRUE值(本质上是1s)相加:
Now, what cumsum
does: it adds up cumulatively all the TRUE values (which are 1s essentially):
cumsum(x == "a")
# [1] 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4
因此,您可以使用逻辑矢量(如数字矢量)并对其进行数学计算,如1s和0s的矢量.
So you can use logical vectors like numeric vectors and do mathematical calculations with them like a vector of 1s and 0s.
这篇关于带索引的for循环的替代方法-R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!