用于查找与向量中的唯一值相关联的索引的有效R代码 [英] Efficient R code for finding indices associated with unique values in vector
问题描述
假设我有向量 vec < - c(D,B,B,C,C)
。
我的目标是结束一个维度 length(unique(vec))
的列表,其中每个 i 在
中返回指向
unique(vec)[i]
例如, vec
的此列表将返回:
exampleList < - list()
exampleList [[1]] <是第一个元素
exampleList [[2]] < - c(2,3)#由于B是第2/3个元素。
exampleList [[3]] < - c(4,5)#由于C是第4/5元素。
我尝试了下面的方法,但是太慢了。我的例子很大,所以我需要更快的代码:
vec < - c(D,B,B ,C,C)
uniques< - unique(vec)
exampleList< lapply(1:3,function(i){
which(vec == uniques [i])
})
exampleList
DT [,list(list(。)),by =。]
有时会导致错误的结果R version> = 3.1.0。现在,在提交#1280 中修正了 data.table v1.9.3。从新闻:
DT [,list(list(。)),by =。]
返回正确的结果在R> = 3.1.0。该错误是由于R v3.1.0中最近(欢迎)更改,其中list(。)
不会导致复制。关闭#481 。
使用 data.table
大约快15倍 tapply
:
library(data.table)
vec <-c(D,B,B,C,C)
dt = as.data.table(vec) list(list(.I)),by = vec]
dt
#vec V1
#1:D 1
#2:B 2,3
# 3:C 4,5
#以所需的格式获得它
#(或许在将来data.table的setnames将用于列表)
setattr(dt $ V1,'names',dt $ vec)
dt $ V1
#$ D
#[1] 1
#
#
$ b# [1] 2 3
#
#$ C
#[1] 4 5
b $ b
速度测试:
vec = sample(letters,1e7,T)
$ b b system.time(tapply(seq_along(vec),vec,identity)[unique(vec)])
#用户系统已过
#7.92 0.35 8.50
system.time ({dt = as.data.table(vec)[,list(list(.I)),by = vec]; setattr(dt $ V1,'names',dt $ vec); dt $ V1})
#用户系统已过
#0.39 0.09 0.49
Suppose I have vector vec <- c("D","B","B","C","C")
.
My objective is to end up with a list of dimension length(unique(vec))
, where each i
of this list returns a vector of indices which denote the locations of unique(vec)[i]
in vec
.
For example, this list for vec
would return:
exampleList <- list()
exampleList[[1]] <- c(1) #Since "D" is the first element
exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element.
exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element.
I tried the following approach but it's too slow. My example is large so I need faster code:
vec <- c("D","B","B","C","C")
uniques <- unique(vec)
exampleList <- lapply(1:3,function(i) {
which(vec==uniques[i])
})
exampleList
Update: The behaviour DT[, list(list(.)), by=.]
sometimes resulted in wrong results in R version >= 3.1.0. This is now fixed in commit #1280 in the current development version of data.table v1.9.3. From NEWS:
DT[, list(list(.)), by=.]
returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 wherelist(.)
does not result in a copy. Closes #481.
Using data.table
is about 15x faster than tapply
:
library(data.table)
vec <- c("D","B","B","C","C")
dt = as.data.table(vec)[, list(list(.I)), by = vec]
dt
# vec V1
#1: D 1
#2: B 2,3
#3: C 4,5
# to get it in the desired format
# (perhaps in the future data.table's setnames will work for lists instead)
setattr(dt$V1, 'names', dt$vec)
dt$V1
#$D
#[1] 1
#
#$B
#[1] 2 3
#
#$C
#[1] 4 5
Speed tests:
vec = sample(letters, 1e7, T)
system.time(tapply(seq_along(vec), vec, identity)[unique(vec)])
# user system elapsed
# 7.92 0.35 8.50
system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1})
# user system elapsed
# 0.39 0.09 0.49
这篇关于用于查找与向量中的唯一值相关联的索引的有效R代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!