快速从data.frame中删除零方差变量 [英] Quickly remove zero variance variables from a data.frame
问题描述
我有一个很大的data.frame,它是由我无法控制的进程生成的,它可能包含也可能不包含方差为零的变量(即所有观察值都相同)。我想根据这些数据建立一个预测模型,显然这些变量没有用。
I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
这是我当前正在使用的从中删除此类变量的函数data.frame。它当前基于 apply
,我想知道是否有任何明显的方法可以加快此函数的速度,以便它可以在非常大的数据集上快速工作,并且需要大量的数据( 400或500)变量?
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply
, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
这是该过程的结果:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
推荐答案
不要使用表()
-这种事情非常慢。一种选择是 length(unique(x))
:
Don't use table()
- very slow for such things. One option is length(unique(x))
:
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
在示例数据集上,这比您的命令量级要快,同时给出相似的输出:
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
在此示例中,西蒙的解决方案同样非常快捷:
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
我们必须看看它们的规模是否与实际问题的规模相似。
but you'll have to see if they scale similarly to real problem sizes.
这篇关于快速从data.frame中删除零方差变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!