在R中优化Apply() [英] Optimizing Apply() In R

查看:93
本文介绍了在R中优化Apply()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码的目标是对具有400列和6000行的数据集执行递归和迭代分析。它在移动到所有可能的组合之前,一次需要两列并对其进行分析。



正在使用的大型数据集的小子集: $ b

  data1 data2 data3 data4 
-0.710003 -0.714271 -0.709946 - 0.713645
-0.710458 -0.715011 -0.710117 -0.714157
-0.71071 -0.714048 -0.710235 -0.713515
-0.710255 -0.713991 -0.709722 -0.71397
-0.710585 -0.714491 -0.710223 -0.713885
-0.710414 -0.714092 -0.710166 -0.71434
-0.711255 -0.714116 -0.70945 -0.714173
-0.71097 -0.714059 -0.70928 -0.714059
-0.710343 -0.714576 -0.709338 -0.713644

使用 apply() code $:

$ $ $ $ p $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
$ b#获取要比较的下一个数据
nextColumn<< ; - currentColumn + 1

while(nextColumn< = ncol(Data)){

#获取执行分析的两列
c1< - Data [,currentColumn]
c2< - Data [,nextColumn]

#创建线性模型
linearModel <-lm(c1〜c2)

#从摘要
获取模型数据modelData< - summary(linearModel)

#残差
residualData <-t(t(modelData $ residuals))

#继续追加数据
linearData<< - cbind(linearData,residualData)

#获取下一列
nextColumn<< - nextColumn + 1

}

#增加计数器
currentColumn<< - currentColumn + 1

}

#应用于函数
apply(Data,2,function(x)analysisFunc())

我认为不是使用循环,而是使用 apply()我优化了代码。但是,它似乎没有重大影响。运行时间超过两个小时。



有人认为,我错了 apply()的含义已被使用?在 apply()中调用不是一个好主意,而<()任何其他方式我可以改善这个代码?



这是我第一次使用函数式编程。请让我知道您的建议,谢谢。 考虑一个 expand.grid mapply 应用系列的多输入版本,您可以在其中传递两个+向量/列表并在每个输入上运行函数按元素。使用这种方法,您可以避免在循环和运行内部时扩展向量,而循环:

Data

pre $ Data < 0.709946 -0.713645
-0.710458 -0.715011 -0.710117 -0.714157
-0.71071 -0.714048 -0.710235 -0.713515
-0.710255 -0.713991 -0.709722 -0.71397
-0.710585 -0.714491 -0.710223 - 0.713885
-0.710414 -0.714092 -0.710166 -0.71434
-0.711255 -0.714116 -0.70945 -0.714173
-0.71097 -0.714059 -0.70928 -0.714059
-0.710343 -0.714576 -0.709338 -0.713644 ,header = TRUE)

流程

 #除相同列以外的所有组合的数据框
modelcols< - subset(expand.grid(c1 = names(Data),c2 = names数据),
stringsAsFactors = FALSE),c1!= c2)

#Function
analysisFunc< - function(x,y){
#获取要执行分析的两列
c1< - Data [[x]]
c2 < - Data [[y]]

#创建线性模型
linearModel <-lm(c1〜c2)

#捕获模型来自摘要
modelData< - 概要(linearModel)

#残差
residualData< - modelData $残差
}

#应用函数返回残差矩阵
linearData< - mapply(analysisFunc,modelcols $ c1,modelcols $ c2)
#重命名矩阵列
colnames(linearData)< - paste( modelcols $ c1,modelcols $ c2,sep =_)

输出

  data2_data1 data3_data1 data4_data1 data1_data2 data3_data2 data4_data2 
1.440828e-04 8.629813e-05 1.514109e-04 5.583917e -04 -0.0001205821 2.866488e-04
2 -6.949384e-04 -2.508770e-04 -2.487813e-04 -1.005367e-04 -0.0001263202 -2.145225e-04
3 2.132192e-04 -4.609125e-04 4.551430e-04 -8.715424e-05 -0.0004593840 4.133856e-04
4 3.692403e-04 2.182627e-04 -1.116648e-04 3.835538​​e-04 0.0000408864 -4.244855e-05
5 -2.025772e-04 -4.032600e-04 5.442655e-05 -8.423568e-05 -0.0003484501 4.986815e-05
6 2.336373e-04 -2.838073e-04 -4.425935e-04 1.967203e-04 -0.0003805576 -4.109706e-04
7 2.661145e-05 1.250425e-04 -6.893342e-05 -6.508936e -04 0.0003408023 -2.436194e-04
8 1.456357e-04 3.991303e-04 -2.496687e-05 -3.501856e-04 0.0004980726 -1.304535e-04
9 -2.349110e-04 5.701233e- 04 2.359596e-04 1.343401e-04 0.0005555326 2.921120e-04
data1_data3 data2_data3 data4_data3 data1_data4 data2_data4 data3_data4
1 5.121547e-04 4.313395e-05 2.829814e-04 4.232081e-04 1.795365e-04 - 9.584175e-05
2 -1.649379e-06 -6.684696e-04 -2.349827e-04 1.975728e-04 -7.112598e-04 -3.014160e-04
3 -2.942277e-04 3.141257e -04 4.029018e-04 -3.420290e-04 2.382149e-04 -3.760631e-04
4 3.371847e-04 2.859362e-04 -3.420612e-05 3.168009e-04 3.048006e-04 1.062117e-04
5 - 1.651011e-04 -1.308671e-04 3.332034e-05 -5.127719e-05 -1.969902e-04 -3.890484e-04
6 2.550032e-05 2.586674e-04 -4.196917e-04 3.235528e-04 2.115955e-04 -3.627735e-04
7 -5.692790e-04 1.157675e-04 -2.277195e-04 -5.922595e-04 1.840773e-04 3.645036e-04
8 -2.258187e- 04 1.445371e-04 -1.077903e-04 -3.583290e-04 2.386756e-04 5.422018e-04
9 3.812360e-04 -3.628313e-04 3.051868e-04 8.276013e-05 -2.870674e-04 5.122258e-04


The goal of the below code is to perform recursive and iterative analysis on a data set that has 400 columns and 6000 rows. It takes, two columns at a time and performs analysis on it, before moving to all the possible combinations.

Small sub set of large data set being used:

  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.71397
-0.710585   -0.714491   -0.710223   -0.713885
-0.710414   -0.714092   -0.710166   -0.71434
-0.711255   -0.714116   -0.70945    -0.714173
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644

Code using apply():

# Function
analysisFunc <- function () {

    # Fetch next data to be compared
    nextColumn <<- currentColumn + 1

    while (nextColumn <= ncol(Data)){

        # Fetch the two columns on which to perform analysis
        c1 <- Data[, currentColumn]
        c2 <- Data[, nextColumn]

        # Create linear model
        linearModel <- lm(c1 ~ c2)

        # Capture model data from summary
        modelData <- summary(linearModel)

        # Residuals
        residualData <- t(t(modelData$residuals))

        # Keep on appending data
        linearData <<- cbind(linearData, residualData)

        # Fetch next column
        nextColumn <<- nextColumn + 1

    }

    # Increment the counter
    currentColumn <<- currentColumn + 1

}

# Apply on function
apply(Data, 2, function(x) analysisFunc ())

I thought instead of using loops, apply() will help me optimize the code. However, it seems to have no major effect. Run time is more than two hours.

Does anyone think, I am going wrong on how apply() has been used? Is having while() within apply() call not a good idea? Any other way I can improve this code?

This is first time I am working with functional programming. Please let me know your suggestion, thanks.

解决方案

Consider an expand.grid of column names and then using mapply the multiple input version of apply family where you pass two+ vectors/lists and run a function across each input elementwise. With this approach you avoid expanding vectors within looping and running an inner while loop:

Data

Data <- read.table(text="  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.71397
-0.710585   -0.714491   -0.710223   -0.713885
-0.710414   -0.714092   -0.710166   -0.71434
-0.711255   -0.714116   -0.70945    -0.714173
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644", header=TRUE)

Process

# Data frame of all combinations excluding same columns 
modelcols <- subset(expand.grid(c1=names(Data), c2=names(Data), 
                    stringsAsFactors = FALSE), c1!=c2)

# Function
analysisFunc <- function(x,y) {        
      # Fetch the two columns on which to perform analysis
      c1 <- Data[[x]]
      c2 <- Data[[y]]

      # Create linear model
      linearModel <- lm(c1 ~ c2)

      # Capture model data from summary
      modelData <- summary(linearModel)

      # Residuals
      residualData <- modelData$residuals
}

# Apply function to return matrix of residuals
linearData <- mapply(analysisFunc, modelcols$c1, modelcols$c2)
# re-naming matrix columns
colnames(linearData) <- paste(modelcols$c1, modelcols$c2, sep="_")

Output

    data2_data1   data3_data1   data4_data1   data1_data2   data3_data2   data4_data2
1  1.440828e-04  8.629813e-05  1.514109e-04  5.583917e-04 -0.0001205821  2.866488e-04
2 -6.949384e-04 -2.508770e-04 -2.487813e-04 -1.005367e-04 -0.0001263202 -2.145225e-04
3  2.132192e-04 -4.609125e-04  4.551430e-04 -8.715424e-05 -0.0004593840  4.133856e-04
4  3.692403e-04  2.182627e-04 -1.116648e-04  3.835538e-04  0.0000408864 -4.244855e-05
5 -2.025772e-04 -4.032600e-04  5.442655e-05 -8.423568e-05 -0.0003484501  4.986815e-05
6  2.336373e-04 -2.838073e-04 -4.425935e-04  1.967203e-04 -0.0003805576 -4.109706e-04
7  2.661145e-05  1.250425e-04 -6.893342e-05 -6.508936e-04  0.0003408023 -2.436194e-04
8  1.456357e-04  3.991303e-04 -2.496687e-05 -3.501856e-04  0.0004980726 -1.304535e-04
9 -2.349110e-04  5.701233e-04  2.359596e-04  1.343401e-04  0.0005555326  2.921120e-04
    data1_data3   data2_data3   data4_data3   data1_data4   data2_data4   data3_data4
1  5.121547e-04  4.313395e-05  2.829814e-04  4.232081e-04  1.795365e-05 -9.584175e-05
2 -1.649379e-06 -6.684696e-04 -2.349827e-04  1.975728e-04 -7.112598e-04 -3.014160e-04
3 -2.942277e-04  3.141257e-04  4.029018e-04 -3.420290e-04  2.382149e-04 -3.760631e-04
4  3.371847e-04  2.859362e-04 -3.420612e-05  3.168009e-04  3.048006e-04  1.062117e-04
5 -1.651011e-04 -1.308671e-04  3.332034e-05 -5.127719e-05 -1.969902e-04 -3.890484e-04
6  2.550032e-05  2.586674e-04 -4.196917e-04  3.235528e-04  2.115955e-04 -3.627735e-04
7 -5.692790e-04  1.157675e-04 -2.277195e-04 -5.922595e-04  1.840773e-04  3.645036e-04
8 -2.258187e-04  1.445371e-04 -1.077903e-04 -3.583290e-04  2.386756e-04  5.422018e-04
9  3.812360e-04 -3.628313e-04  3.051868e-04  8.276013e-05 -2.870674e-04  5.122258e-04

这篇关于在R中优化Apply()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆