我怎样才能更快地对小组观察结果进行排名? [英] How can I rank observations in-group faster?
问题描述
我有一个非常简单的问题,但我可能没有考虑矢量有效地解决问题.我尝试了两种不同的方法,并且它们已经在两台不同的计算机上循环很长时间了.我希望我能说比赛使比赛变得更加激动人心,但是...令人沮丧.
I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
我有很长的数据(每人多行,每人观察一行),我基本上想要一个变量,它告诉我已经观察过该人的频率.
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
我有前两列,并希望第三列:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
现在我正在使用两个循环方法.两者都非常慢(15万行).我确定我会丢失一些东西,但是我的搜索查询还没有真正帮助到我(很难说明问题).
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
感谢任何指针!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
推荐答案
带有 data.table 和 dplyr 软件包.
数据表:
library(data.table)
setDT(foo)[, rn := 1:.N, by = person] # setDT(foo) is needed to convert to a data.table
或具有新的rowid
函数( v1.9.7 + ,目前仅在开发版本)
Or with the new rowid
function (v1.9.7+, currently thus only available in the development version)
setDT(foo)[, rn := rowid(person)]
都给:
> foo
person year rn
1: pers1 1999 1
2: pers1 2000 2
3: pers1 2003 3
4: pers2 1998 1
5: pers2 2011 2
如果您想获得真实排名,则应使用frank
函数:
If you want a true rank, you should use the frank
function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
两者都给出相似的结果:
both giving a similar result:
> foo
Source: local data frame [5 x 3]
Groups: person [2]
person year rn
(fctr) (dbl) (int)
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
这篇关于我怎样才能更快地对小组观察结果进行排名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!