使用R中的向量化提高循环效率 [英] Increase for loop efficiency using vectorization in R

查看:90
本文介绍了使用R中的向量化提高循环效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍然是R的一个相对较新的用户,但是我试图对我的for循环进行矢量化处理以提高计算速度.当前,我需要在data.frame/data.table df上执行两个子集(仅针对此帖子进行命名):1)按组id的子集和2)对于每个特定时间间隔interval的子集id.我设置了一个嵌套的for循环,因为我需要使用lawstat包中的levene.test在数据的每个子集和控件之间执行均方差测试.该测试不适用于多个组,因为它报告单个p.value,因此该子集是必需的.您还将注意到,我的原始代码(请参见下文)将每个测试的p.value记录到一个.csv文件中.

具有21,840个子集(idinterval的唯一组合),循环完成了354.65秒.

library(lawstat)
library(data.table)

df <- as.data.table(df)
control <- as.data.table(control)

id_name <- as.character(unique(df$id)) 

system.time(for(t in 1:length(id_name)) 
{
  sub <- df[id == id_name[t],] 
  intervals <- unique(sub$interval) 
  for(i in 1:length(intervals)) 
  {
    sub_int <- sub[interval == intervals[i],] 
    compare <- rbindlist(list(control, sub_int)) 
    p_value <- with(compare, levene.test(elevation, id, location = "median"))$p.value 
    write.table(p_value, file = "C:/Users/Guest/Desktop/example.csv", 
                row.names=FALSE, append=TRUE, col.names=FALSE, sep = ",")
  }
}
)

 user  system elapsed 
 327.17    8.05  354.65 

在阅读了有关加速for循环的一些条目后(即帖子#1 帖子#2 帖子#3 ),我尝试通过使用plyr::dlply拆分数据框并将每个p.value添加到预配置的矩阵中来减少原始代码.尽管我做了很多尝试来减少代码并增加其流量,但最终还是使我的整体处理时间增加了2369.33秒!这绝对是朝错误方向迈出的一步...

library(plyr)
library(lawstat)
library(data.table)

df <- as.data.table(df)
control <- as.data.table(control)

df <- dlply(df, .(id, interval))
N <- length(df)

p_value <- matrix(nrow = N, ncol = 1)

system.time(for(t in 1:N) 
{
  compare <- rbindlist(list(control, rbindlist(df[t]))) 
  p_value[t,] <- with(compare, levene.test(elevation, id, location = "median"))$p.value 
}
)

user  system elapsed 
 511.62    10.92  2723.98

不包括上述失败,有人会建议提高我的原始代码效率,甚至可能不使用for循环吗?请参见下面的dfcontrol数据框的示例.

谢谢您的时间

> dput(df)
structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), interval = structure(c(15889, 15889, 
15889, 15889, 15889, 15890, 15890, 15890, 15890, 15890, 15891, 
15891, 15891, 15891, 15891, 15889, 15889, 15889, 15889, 15889, 
15889, 15890, 15890, 15890, 15890, 15890, 15891, 15891, 15891, 
15891, 15891, 15892, 15892, 15892, 15892, 15892, 15893, 15893, 
15893, 15893, 15893, 15889, 15889, 15889, 15889, 15889, 15890, 
15890, 15890, 15890, 15891, 15891, 15891, 15891, 15891), class = "Date"), 
    elevation = c(2.725702647, 4.045702647, 7.115702647, 14.1449616, 
    45.81372264, 46.68364133, 46.70361998, 46.7543655, 46.76464597, 
    46.45109248, 47.54507271, 47.52588714, 47.51344638, 47.50128855, 
    47.48358824, -1.670693593, 4.920447963, 8.050447963, 11.95044796, 
    13.75044796, 15.05044796, 50.47345986, 50.46553336, 50.81676644, 
    50.4017478, 50.41152961, 48.58406823, 48.5605739, 48.53763191, 
    48.49345173, 48.47223596, 48.13627925, 48.54543553, 48.54543553, 
    48.53521986, 48.07489144, 48.60564932, 48.57280867, 47.67089291, 
    48.04511639, 48.44131081, -1.169844107, 7.203976499, 12.9339765, 
    17.3339765, 21.7339765, 50.92169461, 50.90168633, 50.85610537, 
    50.81676644, 50.34202032, 50.31834457, 50.31834457, 50.30286461, 
    50.28010937)), .Names = c("id", "interval", "elevation"), row.names = c(NA, 
-55L), class = "data.frame")

> dput(control)
structure(list(id = c("control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control"
), interval = structure(c(15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938), class = "Date"), 
    elevation = c(49.6267375548687, 49.7179641569923, 49.6281678554157, 
    49.2966226173808, 49.4024006688116, 49.5300452221467, 49.4827895052902, 
    49.3035612465632, 49.3341529784207, 49.3618015298912, 49.4158423052967, 
    49.4362086642701, 49.4497426281189, 49.4601318688608, 49.4673666542916, 
    49.4714219838343, 49.906148364131, 49.4583449842877, 49.3643007797062, 
    49.6443051181373, 49.5538302858817, 49.5089656382576, 49.4618701828363, 
    49.4126218440506, 49.5408428265233, 49.5505586622851, 49.4670486774167, 
    49.3681735220153, 49.5053923461751, 49.8626607312497, 49.7899533661675, 
    49.7318954434761, 49.6740089160738, 49.7880048774961, 49.7467682220414, 
    49.678922094158, 49.6255140230552, 49.8004256162408, 51.9533062877202, 
    49.5878680391245)), .Names = c("id", "interval", "elevation"
), row.names = c(NA, -40L), class = "data.frame")

解决方案

我收集到您所说的df必须是data.table;否则您的代码将无法运行.

library(data.table)
library(lawstat)

setDT(df)
setDT(control)
get.pvalue <- function(elevation) {
  compare <- data.table(id=rep(c("T","C"),c(length(elevation),nrow(control))),
                   elevation=c(elevation,control$elevation))
  with(compare,levene.test(elevation,id)$p.value)
}
result <- DT[,get.pvalue(elevation),by=c("id","interval")]

这将产生与代码相同的p.value,但收集在result data.table中.几乎可以肯定,使用单个write.csv(...)进行操作会更快.

这是矢量化的,但是仅比您的for(...)循环(删除了对write.csv(...)的调用)快约20%.

希望其他人有更好的主意.

I am still a relatively new user to R but am attempting to vectorize my for loops in order to increase computational speed. Currently, I need to perform two subsets on the data.frame/ data.table df (only named for this post): 1) subset by group id and 2) subset by each time interval interval for that particular id. I setup a nested for loop because I need to perform a homogeneity of variance test between each subset of data and a control using the levene.test in the lawstat package. The test does not work well with multiple groups as it reports a single p.value so the subset is necessary. You will also notice that my original code (see below) records the p.value for each test to a .csv file row-by-row.

With 21,840 subsets (unique combinations of id and interval), the loop took 354.65 seconds to complete.

library(lawstat)
library(data.table)

df <- as.data.table(df)
control <- as.data.table(control)

id_name <- as.character(unique(df$id)) 

system.time(for(t in 1:length(id_name)) 
{
  sub <- df[id == id_name[t],] 
  intervals <- unique(sub$interval) 
  for(i in 1:length(intervals)) 
  {
    sub_int <- sub[interval == intervals[i],] 
    compare <- rbindlist(list(control, sub_int)) 
    p_value <- with(compare, levene.test(elevation, id, location = "median"))$p.value 
    write.table(p_value, file = "C:/Users/Guest/Desktop/example.csv", 
                row.names=FALSE, append=TRUE, col.names=FALSE, sep = ",")
  }
}
)

 user  system elapsed 
 327.17    8.05  354.65 

After reading some entries in regards to speeding up for loops (i.e. Post #1, Post#2, Post #3), I attempted to reduce my original code by splitting my dataframe using plyr::dlply and adding the each p.value to a pre-configured matrix. Despite my many attempts to reduce my code and increase its flow, I ended up increasing my overall processing time by 2369.33 seconds! This is definitely a step in the wrong direction...

library(plyr)
library(lawstat)
library(data.table)

df <- as.data.table(df)
control <- as.data.table(control)

df <- dlply(df, .(id, interval))
N <- length(df)

p_value <- matrix(nrow = N, ncol = 1)

system.time(for(t in 1:N) 
{
  compare <- rbindlist(list(control, rbindlist(df[t]))) 
  p_value[t,] <- with(compare, levene.test(elevation, id, location = "median"))$p.value 
}
)

user  system elapsed 
 511.62    10.92  2723.98

Not including my above failure, would anyone have suggestions for making my original code more efficient, possibly not even using a for loop? Please see an example of the df and control data frame below.

Thank you for your time

> dput(df)
structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), interval = structure(c(15889, 15889, 
15889, 15889, 15889, 15890, 15890, 15890, 15890, 15890, 15891, 
15891, 15891, 15891, 15891, 15889, 15889, 15889, 15889, 15889, 
15889, 15890, 15890, 15890, 15890, 15890, 15891, 15891, 15891, 
15891, 15891, 15892, 15892, 15892, 15892, 15892, 15893, 15893, 
15893, 15893, 15893, 15889, 15889, 15889, 15889, 15889, 15890, 
15890, 15890, 15890, 15891, 15891, 15891, 15891, 15891), class = "Date"), 
    elevation = c(2.725702647, 4.045702647, 7.115702647, 14.1449616, 
    45.81372264, 46.68364133, 46.70361998, 46.7543655, 46.76464597, 
    46.45109248, 47.54507271, 47.52588714, 47.51344638, 47.50128855, 
    47.48358824, -1.670693593, 4.920447963, 8.050447963, 11.95044796, 
    13.75044796, 15.05044796, 50.47345986, 50.46553336, 50.81676644, 
    50.4017478, 50.41152961, 48.58406823, 48.5605739, 48.53763191, 
    48.49345173, 48.47223596, 48.13627925, 48.54543553, 48.54543553, 
    48.53521986, 48.07489144, 48.60564932, 48.57280867, 47.67089291, 
    48.04511639, 48.44131081, -1.169844107, 7.203976499, 12.9339765, 
    17.3339765, 21.7339765, 50.92169461, 50.90168633, 50.85610537, 
    50.81676644, 50.34202032, 50.31834457, 50.31834457, 50.30286461, 
    50.28010937)), .Names = c("id", "interval", "elevation"), row.names = c(NA, 
-55L), class = "data.frame")

> dput(control)
structure(list(id = c("control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control"
), interval = structure(c(15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938, 
15938, 15938, 15938, 15938, 15938, 15938, 15938, 15938), class = "Date"), 
    elevation = c(49.6267375548687, 49.7179641569923, 49.6281678554157, 
    49.2966226173808, 49.4024006688116, 49.5300452221467, 49.4827895052902, 
    49.3035612465632, 49.3341529784207, 49.3618015298912, 49.4158423052967, 
    49.4362086642701, 49.4497426281189, 49.4601318688608, 49.4673666542916, 
    49.4714219838343, 49.906148364131, 49.4583449842877, 49.3643007797062, 
    49.6443051181373, 49.5538302858817, 49.5089656382576, 49.4618701828363, 
    49.4126218440506, 49.5408428265233, 49.5505586622851, 49.4670486774167, 
    49.3681735220153, 49.5053923461751, 49.8626607312497, 49.7899533661675, 
    49.7318954434761, 49.6740089160738, 49.7880048774961, 49.7467682220414, 
    49.678922094158, 49.6255140230552, 49.8004256162408, 51.9533062877202, 
    49.5878680391245)), .Names = c("id", "interval", "elevation"
), row.names = c(NA, -40L), class = "data.frame")

解决方案

I gather that what you're calling df must be a data.table; otherwise your code doesn't run.

library(data.table)
library(lawstat)

setDT(df)
setDT(control)
get.pvalue <- function(elevation) {
  compare <- data.table(id=rep(c("T","C"),c(length(elevation),nrow(control))),
                   elevation=c(elevation,control$elevation))
  with(compare,levene.test(elevation,id)$p.value)
}
result <- DT[,get.pvalue(elevation),by=c("id","interval")]

This produces the same p.values as your code, but collected in a result data.table. It will almost certainly be faster to do a single write.csv(...) using that.

This is vectorized, but it's only about 20% faster than your for(...) loop (with the calls to write.csv(...) removed).

Hopefully, someone else has a better idea.

这篇关于使用R中的向量化提高循环效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆