如何有效地变异大型数据框的多列 [英] How to efficiently mutate multiple columns of a large dataframe
问题描述
对于将我的函数
有效地应用于大型数据框 DT_large
的多列的任何帮助,我将不胜感激。
I would appreciate any help to efficiently apply my function
to multiple columns of my large data frame DT_large
.
当我将 dplyr :: mutate_at()
应用于小数据框时,我的函数运行良好且高效code> DT_small 。但是,当应用于相对较大的数据集 DT_large
时,可在此处,它需要 dplyr :: mutate_at()
几个小时才能提供所需的输出。
My function works well and efficiently when I apply it with dplyr::mutate_at()
to a small data frame DT_small
. However, when applied to a relatively large dataset DT_large
, available here, it takes dplyr::mutate_at()
several hours to deliver the desired output.
我的代码中有一个错误,导致相对较大的数据集, dplyr :: mutate_at()
的效率降低。或者,可能是 dplyr :: mutate_at()
对于像我这样的较大数据集效率不高。
It might be that there is some mistake in my code that is making dplyr::mutate_at()
less efficient with my relatively large dataset. Alternatively, it might be that dplyr::mutate_at()
is not efficient with relatively large datasets like mine.
无论哪种情况,我都将为解决我的问题提供帮助,这是一种更快的方法,可以将函数正确地应用于 DT_large
并提供所需的输出,如我获得的输出将其应用于 DT_small
。
In either case, I would appreciate any help to solve my problem, that is, a faster way to correctly apply my function to DT_large
and deliver desired output like that I get when I apply it to DT_small
.
#small数据集
DT_small<-structure(list(.id = 1:10, `_E1.1` = c(0.475036902, 0.680123015,
0.896920608, 0.329908621, 0.652288128, 0.408813318, 0.486444822,
0.429333778, 2.643293032, 0.782194143), `_E1.2` = c(79.22653114,
0.680123015, 4.088529776, 0.232076989, 0.652288128, 0.329908621,
0.486444822, 0.429333778, 2.643293032, 0.963554482), `_E1.3` = c(0.466755502,
0.680123015, 0.461887024, 1.236938197, 0.652288128, 0.408813318,
0.486444822, 0.429333778, 2.643293032, 0.95778584), `_E1.4` = c(1.608298119,
0.680123015, 0.578464999, 0.317125521, 0.652288128, 0.408813318,
0.486444822, 0.429333778, 2.643293032, 2.125841957), `_E1.5` = c(0.438424932,
0.680123015, 0.896920608, 0.366118007, 0.652288128, 1.007079029,
0.486444822, 0.429333778, 2.643293032, 0.634134022), `_E10.1` = c(0.45697607,
0.647681721, 1.143509029, 0.435735621, 0.49400961, 0.501421816,
0.461123723, 0.568477247, 1.756598213, 0.67895017), `_E10.2` = c(35.30312978,
0.647681721, 2.58357783, 0.25514789, 0.49400961, 0.435735621,
0.461123723, 0.568477247, 1.756598213, 0.776970116), `_E10.3` = c(0.79477661,
0.647681721, 0.672430959, 0.886991224, 0.49400961, 0.501421816,
0.461123723, 0.568477247, 1.756598213, 1.019701072), `_E10.4` = c(1.912254794,
0.647681721, 0.840757508, 0.414669983, 0.49400961, 0.501421816,
0.461123723, 0.568477247, 1.756598213, 1.576577576), `_E10.5` = c(0.429335115,
0.647681721, 1.143509029, 0.336512868, 0.49400961, 0.82434125,
0.461123723, 0.568477247, 1.756598213, 0.639407175), `_E100.1` = c(0.567579678,
0.780423094, 1.739967261, 0.282217304, 0.784904687, 0.319146371,
0.585056235, 0.596494912, 3.545358563, 0.899595619)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
#large数据集
1) download to your directory from https://jmp.sh/iC6WOzw
2) DT_large <- read_csv("DT_large.csv")
#my函数
my_dataset$new_variable <- ifelse(my_dataset$old_variable >quantile(
my_dataset$old_variable,probs=0.80),quantile(
my_dataset$old_variable,probs=0.80),my_dataset$old_variable)
#我的函数应用于我的小型数据集
//this perfectly delivers the desired output in seconds
DT_small %>% mutate_at(vars(matches("_E")),
funs(ifelse(
DT_small$.>quantile(
DT_small$.,probs=0.80),quantile(
DT_small$.,probs=0.80),DT_small$.)))
#我的函数已应用于我的大型数据集
//this takes several hours to deliver the desired output
DT_large %>% mutate_at(vars(matches("_E")),
funs(ifelse(
DT_large$.>quantile(
DT_large$.,probs=0.80),quantile(
DT_large$.,probs=0.80),DT_large$.)))
在此先感谢您的帮助。
推荐答案
您可以通过以下方式获得相当大的提速:1.)计算一次分位数,以及2.)将新的更简约的函数应用于
You can get a pretty big speedup by: 1.) computing the quantile once, and 2.) applying your new more parsimonious function to the columns.
在我的机器上,这种方法快15倍。
On my machine this approach is about 15x faster.
library(dplyr)
library(microbenchmark)
dplyr_res <- DT_small %>% mutate_at(vars(matches("_E")),
funs(ifelse(
DT_small$.>quantile(
DT_small$.,probs=0.80),quantile(
DT_small$.,probs=0.80),DT_small$.)))
fun_col <- function(col) {
m <- quantile(col, .8) # compute once
ifelse(col > m, m, col)
}
sapply_res <- sapply(DT_small[,2:ncol(DT_small)], fun_col)
dplyr_res %>% dplyr::select(-.id) == sapply_res
#> _E1.1 _E1.2 _E1.3 _E1.4 _E1.5 _E10.1 _E10.2 _E10.3 _E10.4 _E10.5
#> [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> _E100.1
#> [1,] TRUE
#> [2,] TRUE
#> [3,] TRUE
#> [4,] TRUE
#> [5,] TRUE
#> [6,] TRUE
#> [7,] TRUE
#> [8,] TRUE
#> [9,] TRUE
#> [10,] TRUE
microbenchmark(dplyr_res = DT_small %>% mutate_at(vars(matches("_E")),
funs(ifelse(
DT_small$.>quantile(
DT_small$.,probs=0.50),quantile(
DT_small$.,probs=0.50),DT_small$.))),
sapply_res = sapply(DT_small[,2:ncol(DT_small)], fun_col))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> dplyr_res 12.372519 12.668833 13.577804 12.856150 13.553805 60.220232
#> sapply_res 1.808413 1.850595 1.966174 1.874696 1.911037 3.441024
#> neval cld
#> 100 b
#> 100 a
在这里节省重新计算的工作量可能很大。我没有明确测试以查看 sapply()
是否比 mutate_at
快。
Saving on recomputing is probably doing a bunch of the work here. I didn't explicitly test to see if sapply()
is faster than mutate_at
.
parallel::mcmapply(fun_col, DT_small %>% select(-.id))
取决于安装的并行软件包。
Depends on having the parallel package installed.
这篇关于如何有效地变异大型数据框的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!