优化数据框中的替换 [英] Optimizing replacement in a data frame

查看:130
本文介绍了优化数据框中的替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是的更新对基于名称中的模式的列。因此,这部分是出于好奇,部分是为了娱乐。

在回答这个问题时,我想到这可能是其中一个 for 循环比 * apply 函数更有效率(我一直在寻找一个很好的例子那么 * apply 不一定比为循环构造良好的更高效。所以我想再次提出这个问题,并询问是否有人能够使用 * apply 函数(或 purr 如果这是你的东西)比我下面写的循环的更好。性能将根据笔记本电脑上的 microbenchmark 来评估执行时间(一个运行R 3.3.2的廉价Windows操作系统)。

data.table dplyr 建议也是受欢迎的。 (我已经制定了计划,我将在保存的所有微秒内完成)



挑战



考虑数据帧:

  col_1 < -  c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <-c(1,23423,NA,23)
temp_col_2< -c(1,2, 23,4,5)

df_test< - data.frame(col_1,temp_col_1,col_2,temp_col_2)
set.seed(pi)
df_test< - df_test [示例(1:nrow(df_test),1000,replace = TRUE),]

c $ c> col_x ,用 temp_col_x 中的相应值替换缺失的值。所以,例如:

  col_1 temp_col_1 col_2 temp_col_2 
1 1 12 1 1
2 2 2 23 2
3不适用2 423 23
4 4 3不适用4
5 5 4 23 5

变为

  col_1 temp_col_1 col_2 temp_col_2 
1 1 12 1 1
2 2 2 23 2
3 2 2 423 23
4 4 3 4 4
5 5 4 23 5



现有的解决方案



p>

  temp_cols<  - 名称(df_test)[grepl(^ temp,names(df_test))] 
cols< ;( - )()()()()()()()()()() [i]]]))
df_test [[cols [i]]] [row_to_replace]& lt; - df_test [[temp_cols [i]]] [row_to_replace]
}

我的

  lapply(names(df_test) (grep(^ temp_,names(df_test))],
function(tc){
col < - sub(^ temp_,,tc)
row_to_replace< ; - (which.na(df_test [[col]]))
df_test [[col]] [row_to_replace]< - df_test [[tc]] [row_to_replace]
})


$ b

标杆对比



我将开始在这个问题的编辑中显示基准。 (编辑:代码现在是弗兰克答案的副本,但在我的机器上运行100次,如承诺)
$ b $ pre $库$库)
library(data.table)
library(microbenchmark)
set.seed(pi)

nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10,NA_integer_),nc * nr,replace = TRUE)%>%矩阵(nr,nc)%>%data.frame
df_r = sample :10),nc * nr,replace = TRUE)%>%矩阵(nr,nc)%>%data.frame


microbenchmark(times = 100,
for_vec = {
df_m <-df_m0
for(col in 1:nc){
w < - which(is.na(df_m [[col]]))
df_m [[col]] [w] <-df_r [[col]] [w]
}
},lapply_vec = {
df_m < - df_m0
lapply (seq_along(df_m),
function(i){
w< - which(is.na(df_m [[i]]))
d f_m [] [w] <-df_r [[i]] [w]
})

},for_df = {
df_m < df_m0
for(col in 1:nc){
w < - which(is.na(df_m [[col]]))
df_m [w,col] < - df_r [ (

$),lapply_df = {
df_m < - df_m0
lapply(seq_along(df_m),
function(i){$ b $ (df_m [[i]]))
df_m [w,i] <-df_r [w,i]
})
} ,mat = {#in lmo's answer
df_m < - df_m0
bah = is.na(df_m)
df_m [bah] = df_r [bah]
},set = {
df_m< - copy(df_m0)
for(col in 1:nc){
w = which(is.na(df_m [[col]]))
set(df_m,i = w,j = col,v = df_r [w,col])
}
}

$ b $ p





$ b $ lq平均值uq max neval cld
for_vec 135.83875 157.84548 175.23005 166.60090 176.81839 502.0616 100 b
lapply_vec 135.67322 158.99496 179.53474 165.11883 178.06968 551.7709 100 b
for_df 173.95971 204.16368 222.30677 212.76608 224.78188 446.6050 100 c
lapply_df 181.46248 205.57069 220.38911 215.08505 223.98406 381.1006 100 c
垫129.27835 154.01248 173.11378 159.83070 169.67439 453.0888 100 b
套66.86402 81.08138 86.32626 85.51029 89.58331 123.1926 100 a
set 函数来修改data.tables或者 data.frames通过引用。

下面是一个比较灵活的关于cols和rows数的基准,并且避开了OP中令人尴尬的列名称:$ b $ (c(1:10,NA_integer_))b

  library(magrittr)
nc = 1e3
nr = 1e2
df_m0 = ),nc * nr,replace = TRUE)%>%矩阵(nr,nc)%>%data.frame
df_r = sample(c(1:10),nc * nr,replace = TRUE) %>%matrix(nr,nc)%>%data.frame

library(data.table)
library(microbenchmark)
microbenchmark(times = 10,
for_vec = {
df_m <-df_m0
for(col in 1:nc){
w < - which(is.na(df_m [[col]]))
df_m [[col]] [w]< - df_r [[col]] [w]
}
},lapply_vec = {
df_m < - df_m0 $ b $ (b),(b),(b),(b),(b)和(b) ; - (df_r [[i]] [w]
})
},for_df = {
df_m < - df_m0
(col in 1:nc){
w&l t; - (哪个是(df_m [[col]]))
df_m [w,col] <-df_r [w,col]
}
},lapply_df = {
df_m < - df_m0
lapply(seq_along(df_m),function(i){
w< - which(is.na(df_m [[i]]))
df_m [w,i] <-df_r [w,i]
})
},mat = {#lmo的答案
df_m < - df_m0
bah =(。)(df_m)
df_m [bah] = df_r [bah]
},set = {
df_m < - copy(df_m0)
:nc){
w = which(is.na(df_m [[col]]))
set(df_m,i = w,j = col,v = df_r [w,col])




$ b $ p

...

 单位:毫秒
expr min lq平均值uq max neval
for_vec 77.06501 89.53430 100.10051 96.33764 106.13486 142.1329 10
lapply_vec 77.67366 89.04438 98.81510 99.08863 108.86491 117.2956 10
for_df 103.79097 130.33134 140.95398 1 44.46526 157.11335 161.4507 10
lapply_df 97.04616 114.17825 126.10633 131.20382 137.64375 149.7765 10
垫73.47691 84.51473 100.16745 103.44476 112.58006 128.6166 10
set 44.32578 49.58586 62.52712 56.30460 71.63432 101.3517 10





  • 如果我们调整 nc nr 或者 NA s的频率这四个选项可能会改变。我猜这里面越多越好, mat (从@ lmo的答案)和 set 方式看起来越好。


  • set 测试中的 copy 因为 set 函数只是通过引用修改表(不同于其他选项,我认为),所以需要额外的时间。


  • This is an extension of Update pairs of columns based on pattern in their names . Thus, this is partially motivated by curiosity and partially for entertainment.

    While developing an answer to that question, it occurred to me that this may be one of those cases where a for loop is more efficient than an *apply function (and I've been looking for a good illustration of the fact that *apply is not necessarily "more efficient" than a well constructed for loop). So I'd like to pose the question again, and ask if anyone is able to write a solution using an *apply function (or purr if that's your thing) that performs better than the for loop I've written below. Performance will be judged on execution time as evaluated via microbenchmark on my laptop (A cheap Windows box running R 3.3.2).

    data.table and dplyr suggestions are welcome as well. (I'm already making plans for what I'll do with all the microseconds I save).

    The Challenge

    Consider the data frame:

    col_1 <- c(1,2,NA,4,5)
    temp_col_1 <-c(12,2,2,3,4)
    col_2 <- c(1,23,423,NA,23)
    temp_col_2 <-c(1,2,23,4,5)
    
    df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2) 
    set.seed(pi)
    df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]
    

    For each col_x, replace the missing values with the corresponding value in temp_col_x. So, for example:

      col_1 temp_col_1 col_2 temp_col_2
    1     1         12     1          1
    2     2          2    23          2
    3    NA          2   423         23
    4     4          3    NA          4
    5     5          4    23          5
    

    becomes

      col_1 temp_col_1 col_2 temp_col_2
    1     1         12     1          1
    2     2          2    23          2
    3     2          2   423         23
    4     4          3     4          4
    5     5          4    23          5
    

    Existing Solutions

    The for loop I've already written

    temp_cols <- names(df_test)[grepl("^temp", names(df_test))]
    cols <- sub("^temp_", "", temp_cols)
    
    for (i in seq_along(temp_cols)){
      row_to_replace <- which(is.na(df_test[[cols[i]]]))
      df_test[[cols[i]]][row_to_replace] <- df_test[[temp_cols[i]]][row_to_replace]
     }
    

    My best apply function so far is:

    lapply(names(df_test)[grepl("^temp_", names(df_test))],
           function(tc){
             col <- sub("^temp_", "", tc)
             row_to_replace <- which(is.na(df_test[[col]]))
             df_test[[col]][row_to_replace] <<- df_test[[tc]][row_to_replace]
           })
    

    Benchmarking

    As (if) suggestions come in, I will begin showing benchmarks in edits to this question. (edit: code is now a copy of Frank's answer, but run 100 times on my machine, as promised)

    library(magrittr)
    library(data.table)
    library(microbenchmark)
    set.seed(pi)
    
    nc = 1e3
    nr = 1e2
    df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
    df_r  = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
    
    
    microbenchmark(times = 100,
                   for_vec = {
                     df_m <- df_m0
                     for (col in 1:nc){
                       w <- which(is.na(df_m[[col]]))
                       df_m[[col]][w] <- df_r[[col]][w]
                     }
                   }, lapply_vec = {
                     df_m <- df_m0
                     lapply(seq_along(df_m),
                            function(i){
                              w <- which(is.na(df_m[[i]]))
                              df_m[[i]][w] <<- df_r[[i]][w]
                            })
    
                   }, for_df = {
                     df_m <- df_m0
                     for (col in 1:nc){
                       w <- which(is.na(df_m[[col]]))
                       df_m[w, col] <- df_r[w, col]
                     }
                   }, lapply_df = {
                     df_m <- df_m0
                     lapply(seq_along(df_m),
                            function(i){
                              w <- which(is.na(df_m[[i]]))
                              df_m[w, i] <<- df_r[w, i]
                            })
                   }, mat = { # in lmo's answer
                     df_m <- df_m0
                     bah = is.na(df_m)
                     df_m[bah] = df_r[bah]
                   }, set = {
                     df_m <- copy(df_m0)
                     for (col in 1:nc){
                       w = which(is.na(df_m[[col]]))
                       set(df_m, i = w, j = col, v = df_r[w, col])
                     }
                   }
    )
    

    Results:

    Unit: milliseconds
           expr       min        lq      mean    median        uq      max neval cld
        for_vec 135.83875 157.84548 175.23005 166.60090 176.81839 502.0616   100  b 
     lapply_vec 135.67322 158.99496 179.53474 165.11883 178.06968 551.7709   100  b 
         for_df 173.95971 204.16368 222.30677 212.76608 224.78188 446.6050   100   c
      lapply_df 181.46248 205.57069 220.38911 215.08505 223.98406 381.1006   100   c
            mat 129.27835 154.01248 173.11378 159.83070 169.67439 453.0888   100  b 
            set  66.86402  81.08138  86.32626  85.51029  89.58331 123.1926   100 a  
    

    解决方案

    Data.table provides the set function to modify data.tables or data.frames by reference.

    Here's a benchmark that is more flexible with respect to numbers of cols and rows and that sidesteps the awkward column-name stuff in the OP:

    library(magrittr)
    nc = 1e3
    nr = 1e2
    df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
    df_r  = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
    
    library(data.table)
    library(microbenchmark)
    microbenchmark(times = 10,
      for_vec = {
        df_m <- df_m0
        for (col in 1:nc){
          w <- which(is.na(df_m[[col]]))
          df_m[[col]][w] <- df_r[[col]][w]
        }
        }, lapply_vec = {
        df_m <- df_m0
        lapply(seq_along(df_m), function(i){
              w <- which(is.na(df_m[[i]]))
              df_m[[i]][w] <<- df_r[[i]][w]
        })
      }, for_df = {
        df_m <- df_m0
        for (col in 1:nc){
          w <- which(is.na(df_m[[col]]))
          df_m[w, col] <- df_r[w, col]
        }
        }, lapply_df = {
        df_m <- df_m0
        lapply(seq_along(df_m), function(i){
              w <- which(is.na(df_m[[i]]))
              df_m[w, i] <<- df_r[w, i]
        })
      }, mat = { # in lmo's answer
        df_m <- df_m0
        bah = is.na(df_m)
        df_m[bah] = df_r[bah]
      }, set = {
        df_m <- copy(df_m0)
        for (col in 1:nc){
          w = which(is.na(df_m[[col]]))
          set(df_m, i = w, j = col, v = df_r[w, col])
        }
      }
    )
    

    Which gives...

    Unit: milliseconds
           expr       min        lq      mean    median        uq      max neval
        for_vec  77.06501  89.53430 100.10051  96.33764 106.13486 142.1329    10
     lapply_vec  77.67366  89.04438  98.81510  99.08863 108.86491 117.2956    10
         for_df 103.79097 130.33134 140.95398 144.46526 157.11335 161.4507    10
      lapply_df  97.04616 114.17825 126.10633 131.20382 137.64375 149.7765    10
            mat  73.47691  84.51473 100.16745 103.44476 112.58006 128.6166    10
            set  44.32578  49.58586  62.52712  56.30460  71.63432 101.3517    10
    

    Comments:

    • If we adjust nc and nr or the frequency of NAs, the ranking of these four options might change. I guess the more cols there are, the better the mat way (from @lmo's answer) and set way look.

    • The copy in the set test takes some extra time beyond what we'd see in practice, since the set function just modifies the table by reference (unlike the other options, I think).

    这篇关于优化数据框中的替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆