R数据表中较慢的ifelse的替代方案 [英] Alternative of slower ifelse in R data table

查看:134
本文介绍了R数据表中较慢的ifelse的替代方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个函数,其中多个ifelse用于数据表操作。虽然我使用数据表来提高速度但是多个ifelse使我的代码变慢并且此功能适用于大型数据集。因此,我想知道是否有替代iflese。
来自函数的一个示例iflese(接近15 iflese),在此示例中,如果x为空,则标志设置为1,否则为0.

  dt< -dt [,flag:= ifelse(is.na(x)|!nzchar(x),1,0)] 

如果这是一个重复的问题我很抱歉。



提前致谢。

解决方案

最快的方法可能取决于您的数据是什么样的。评论中提到的那些在本例中都具有可比性:



两次被@DavidArenburg提及;以及 oncedd by @akrun。我不确定如何使用复制> 1对这些进行基准测试,因为对象实际上是在基准测试期间修改。)

  DT<  -  data.table(x =样本(c(NA,,字母) ),1e8,replace = TRUE))

DT0< - copy(DT)
DT1< - copy(DT)
DT2< - copy(DT)
DT3< - copy(DT)
DT4< - copy(DT)
DT5< - copy(DT)
DT6< - copy(DT)
DT7< - copy(DT)

library(rbenchmark)
benchmark(
ifelse = DT0 [,flag:= ifelse(is.na(x)|!nzchar) (x),1L,0L)],
keyit = {
setkey(DT1,x)
DT1 [,flag:= 0L]
DT1 [J(NA_character_, ),flag:= 1L]
},
两次= DT2 [,flag:= 0L] [is.na(x)|!nzchar(x),flag:= 1L,by = x ],
两次= DT3 [,flag:= 0L] [is.na(x)|!nzchar(x),flag:= 1L],
onceby = DT4 [,flag:= +(is.na(x)|!nzchar(x)),by = x],
一次= DT5 [,flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6 [,flag:=(is.na(x)|!nzchar(x))+ 0L],
oncebyk = {setkey(DT7,x); DT7 [,flag:= +(is.na(x)|!nzchar(x)),by = x]},
replications = 1
)[1:5]
#测试复制过去相对user.self
#1 ifelse 1 19.61 31.127 17.32
#2 keyit 1 0.63 1.000 0.47
#6一次1 3.26 5.175 2.68
#7 onceadd 1 3.24 5.143 2.88
#5 onceby 1 1.81 2.873 1.75
#8 oncebyk 1 0.91 1.444 0.82
#4两次1 3.17 5.032 2.79
#3两次1 3.45 5.476 3.16

讨论。在此示例中, keyit 是最快的。但是,它也是最冗长的,它会改变表的排序。此外, keyit 非常特定于OP的问题(利用恰好两个字符值符合条件的事实 is.na(x)| !nzchar(x)),对于其他需要编写类似



<$ p $的应用程序来说可能不太好p> keyit = {
setkey(DT1,x)
flagem = DT1 [,some_other_condition(x),by = x] [(V1)] $ x
DT1 [,flag:= 0L]
DT1 [J(flagem),flag:= 1L]
}


I am writing a function where multiple ifelse are being used for data table operation. Although I am using data tables for speed but multiple ifelse making my code slow and this function is for large data set. Hence I was wondering if there is an alternative to iflese. One example iflese from the function(there are close to 15 iflese ), in this example flag is set to 1 if x is blank else 0.

    dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]

My apologies if this is a duplicate question.

Thanks in advance.

解决方案

The fastest approach will probably depend on what your data looks like. Those mentioned in the comments are all comparable for this example:

(twice was mentioned by @DavidArenburg; and onceadd by @akrun. I'm not really sure how to benchmark these with replications > 1, since the objects are actually modified during the benchmark.)

DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))

DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)

library(rbenchmark)
benchmark(
ifelse  = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit   = {
    setkey(DT1,x)   
    DT1[,flag:=0L]
    DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice   = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby  = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once    = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
#      test replications elapsed relative user.self
# 1  ifelse            1   19.61   31.127     17.32
# 2   keyit            1    0.63    1.000      0.47
# 6    once            1    3.26    5.175      2.68
# 7 onceadd            1    3.24    5.143      2.88
# 5  onceby            1    1.81    2.873      1.75
# 8 oncebyk            1    0.91    1.444      0.82
# 4   twice            1    3.17    5.032      2.79
# 3 twiceby            1    3.45    5.476      3.16

Discussion. In this example, keyit is the fastest. However, it's also the most verbose and it changes the sorting of your table. Also, keyit is very specific to the OP's question (taking advantage of the fact that exactly two character values fit the condition is.na(x)|!nzchar(x)), and so might not be as great for other applications, where it would need to be written something like

keyit   = {
    setkey(DT1,x)
    flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
    DT1[,flag:=0L]
    DT1[J(flagem),flag:=1L]
}

这篇关于R数据表中较慢的ifelse的替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆