如何更快地对组内的观察进行排名? [英] How can I rank observations in-group faster?
问题描述
我有一个非常简单的问题,但我可能没有考虑足够的向量来有效地解决它.我尝试了两种不同的方法,它们已经在两台不同的计算机上循环了很长时间.我希望我能说比赛让它变得更令人兴奋,但是...... bleh.
对组中的观察进行排序
我有很长的数据(每个人有很多行,每个人观察一行),我基本上想要一个变量,它告诉我这个人被观察的频率.
我有前两列,想要第三列:
人波观察pers1 1999 1人 1 2000 2pers1 2003 3pers2 1998 1pers2 2001 2
现在我使用两种循环方法.两者都非常慢(150k 行).我确定我遗漏了一些东西,但我的搜索查询还没有真正帮助我(很难说出问题).
感谢您的指点!
# 按 persnr 和观察年份排序的数据集person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave), ]person.obs$n.obs = 0# 第一种方法:遍历人员并分配范围unp = 唯一(person.obs$PERSNR)unplength = 长度(unp)for(i in 1:unplength) {打印(unp[i])person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)我=我+1GC()}# 第二种方法:遍历行并在新人处重置计数器pnr = 0for(i in 1:length(person.obs[,2])) {if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNRe = 0}e=e+1person.obs[i,]$n.obs = e我=我+1GC()}
data.table 和 dplyr 包.
数据表:
library(data.table)# setDT(foo) 需要转换为data.table# 选项1:setDT(foo)[, rn := rowid(person)]#选项2:setDT(foo)[, rn := 1:.N, by = person]
都给:
<块引用>>富人年 rn1: pers1 1999 12: pers1 2000 23: pers1 2003 34:pers2 1998 15: pers2 2011 2
如果你想要一个真正的排名,你应该使用frank
函数:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
库(dplyr)#方法一foo <- foo %>% group_by(person) %>% mutate(rn = row_number())# 方法二foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
两者都给出了相似的结果:
<块引用>>富来源:本地数据框 [5 x 3]组:人 [2]人年 rn(fctr) (dbl) (int)1 人1 1999 12 人 1 2000 23 人1 2003 34 人2 1998 15 人2 2011 2
I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
rank observations in group
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
I have the first two columns and want the third one:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
Thanks for any pointers!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
A few alternatives with the data.table and dplyr packages.
data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:
> foo person year rn 1: pers1 1999 1 2: pers1 2000 2 3: pers1 2003 3 4: pers2 1998 1 5: pers2 2011 2
If you want a true rank, you should use the frank
function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:
> foo Source: local data frame [5 x 3] Groups: person [2] person year rn (fctr) (dbl) (int) 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2
这篇关于如何更快地对组内的观察进行排名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!