如何更快地对组内的观察进行排名? [英] How can I rank observations in-group faster?

查看:20
本文介绍了如何更快地对组内的观察进行排名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的问题,但我可能没有考虑足够的向量来有效地解决它.我尝试了两种不同的方法,它们已经在两台不同的计算机上循环了很长时间.我希望我能说比赛让它变得更令人兴奋,但是...... bleh.

对组中的观察进行排序

我有很长的数据(每个人有很多行,每个人观察一行),我基本上想要一个变量,它告诉我这个人被观察的频率.

我有前两列,想要第三列:

人波观察pers1 1999 1人 1 2000 2pers1 2003 3pers2 1998 1pers2 2001 2

现在我使用两种循环方法.两者都非常慢(150k 行).我确定我遗漏了一些东西,但我的搜索查询还没有真正帮助我(很难说出问题).

感谢您的指点!

# 按 persnr 和观察年份排序的数据集person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave), ]person.obs$n.obs = 0# 第一种方法:遍历人员并分配范围unp = 唯一(person.obs$PERSNR)unplength = 长度(unp)for(i in 1:unplength) {打印(unp[i])person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)我=我+1GC()}# 第二种方法:遍历行并在新人处重置计数器pnr = 0for(i in 1:length(person.obs[,2])) {if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNRe = 0}e=e+1person.obs[i,]$n.obs = e我=我+1GC()}

解决方案

包.

数据表:

library(data.table)# setDT(foo) 需要转换为data.table# 选项1:setDT(foo)[, rn := rowid(person)]#选项2:setDT(foo)[, rn := 1:.N, by = person]

都给:

<块引用>

>富人年 rn1: pers1 1999 12: pers1 2000 23: pers1 2003 34:pers2 1998 15: pers2 2011 2

如果你想要一个真正的排名,你应该使用frank函数:

setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr:

库(dplyr)#方法一foo <- foo %>% group_by(person) %>% mutate(rn = row_number())# 方法二foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

两者都给出了相似的结果:

<块引用>

>富来源:本地数据框 [5 x 3]组:人 [2]人年 rn(fctr) (dbl) (int)1 人1 1999 12 人 1 2000 23 人1 2003 34 人2 1998 15 人2 2011 2

I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.

rank observations in group

I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.

I have the first two columns and want the third one:

person  wave   obs
pers1   1999   1
pers1   2000   2
pers1   2003   3
pers2   1998   1
pers2   2001   2

Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).

Thanks for any pointers!

# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]

person.obs$n.obs = 0

# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
   print(unp[i])
   person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs = 
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
    i=i+1
   gc()
}

# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
  if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
  e = 0
  }
  e=e+1
  person.obs[i,]$n.obs = e
  i=i+1
  gc()
}

解决方案

A few alternatives with the and packages.

data.table:

library(data.table)
# setDT(foo) is needed to convert to a data.table

# option 1:
setDT(foo)[, rn := rowid(person)]   

# option 2:
setDT(foo)[, rn := 1:.N, by = person]

both give:

> foo
   person year rn
1:  pers1 1999  1
2:  pers1 2000  2
3:  pers1 2003  3
4:  pers2 1998  1
5:  pers2 2011  2

If you want a true rank, you should use the frank function:

setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr:

library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

both giving a similar result:

> foo
Source: local data frame [5 x 3]
Groups: person [2]

  person  year    rn
  (fctr) (dbl) (int)
1  pers1  1999     1
2  pers1  2000     2
3  pers1  2003     3
4  pers2  1998     1
5  pers2  2011     2

这篇关于如何更快地对组内的观察进行排名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆