如何在R中矢量化for循环 [英] How to vectorize a for loop in R

查看:168
本文介绍了如何在R中矢量化for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图清理这个代码,并想知道是否有人有任何建议,如何在没有循环运行在R。我有一个数据集称为数据与100个变量和20万观测。我想要做的就是扩大数据集,将每个观察值乘以特定的标量,然后将这些数据组合在一起。最后,我需要一个包含80万个观察值(我有四个类别创建)和101个变量的数据集。这是我写的一个循环,但是效率很低,我希望更快,更高效。

  datanew < -  c()
for(i in 1:51){
for (in 1:4){
for(m in 1:4){

sub < - subset(data,data $ var1 == i& data $ var2 == k )

sub [,4:(ncol(sub)-1)] < - filingstat0711 [i,k,m] * sub [,4:(ncol(sub)-1)]

sub $ newvar < - m

datanew < - rbind(datanew,sub)



}

请让我知道您的想法,并感谢您的帮助。

下面是一些带有2K观察值的样本数据,而不是200K

  #SAMPLE DATA 
#------------------------------------------ (矩阵(100 * 20e2),ncol = 20e2,nrow = 100))
var1 <-c(sapply (seq(41),function(x)sample(1:51)))[1:20e2]
var2 <-c(sapply(seq(2 + 20e2 / 6) 1:6)))[1:20e2]
#---------------------------------- #
mydf < - cbind(var1,var2,round(mydf [3:100] * 2.5,2))
filingstat0711< - array(rnorm(51 * 6 * 4)* 1.5 + abs(rnorm(2)* 10)),dim = c(51,6,4))
#--------------------- ---------------------------#


解决方案

您可以尝试以下操作。请注意,我们用调用 mapply 替换了前两个for循环,第三个for循环调用了lapply。
另外,我们正在创建两个向量,我们将结合使用向量化乘法。

 #使用`expand.grid`创建ik索引组合的表
ixk < - expand.grid (i = 1:51,k = 1:6)

#看看expand.grid是什么
头(ixk,60)

$ (c(0,1),times = c(4,ncol(mydf)-4-1)),0(b,b)生成两个向量,用于乘以我们的数据帧子集
multpVec < )
invVec< - !multpVec

#如何使用向量
(multpVec * filingstat0711 [1,2,1] + invVec)


#而不是for循环,我们可以使用mapply。
newdf< -
mapply(function(i,k)

#你正在使用的函数是:
#通过匹配var1& var2
#然后乘以filingstat中的一个值
来进行子集化的数据帧do.call(rbind,
#遍历m
lapply(1 :4,函数(m)

#cbind是用于添加newvar = m,在子表的末尾
cbind(

#)我们转置两次:首先将子集与我们的向量相乘
#然后返回结果得到原始形式
t(subset(mydf,var1 == i& mydf $ var2 == k))*
(multpVec * filingstat0711 [i,k,m] + invVec)),

#这是一个参数给cbind
newvar= m)
) ),

#你传递的两个列表作为参数是展开网格的列
ixk $ i,ixk $ k,SIMPLIFY = FALSE


#f latten数据帧
newdf< - do.call(rbind,newdf)

< (1)尽量不要使用数据 df 等等常用函数
在上面的代码中,我用 mydf 来代替 data



(2)您可以使用 apply(ixk,1,fu ..)来代替 mapply 我用过,但我认为在这种情况下,使得代码变得更简洁



祝你好运,欢迎来到SO

I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.

datanew <- c()
for (i in 1:51){
  for (k in 1:6){
    for (m in 1:4){

      sub <- subset(data,data$var1==i & data$var2==k)

      sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]

      sub$newvar <- m

      datanew <- rbind(datanew,sub)

    }
  }
}

Please let me know what you think and thanks for the help.

Below is some sample data with 2K observations instead of 200K

# SAMPLE DATA
#------------------------------------------------#
  mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
  var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
  var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
  #----------------------------------#
  mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
  filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#

解决方案

You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply. Also, we are creating two vectors that we will combine for vectorized multiplication.

# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)

    # Take a look at what expand.grid does
    head(ixk, 60)


# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec   <- !multpVec

    # example of how we will use the vectors
    (multpVec * filingstat0711[1, 2, 1] + invVec)


# Instead of for loops, we can use mapply. 
newdf <- 
  mapply(function(i, k) 

    # The function that you are `mapply`ing is:
    # rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
    # and then multiplying by a value in filingstat
    do.call(rbind, 
        # iterating over m
        lapply(1:4, function(m)

          # the cbind is for adding the newvar=m, at the end of the subtable
          cbind(

            # we transpose twice: first the subset to multiply our vector. 
            # Then the result, to get back our orignal form
            t( t(subset(mydf, var1==i & mydf$var2==k)) * 
              (multpVec * filingstat0711[i,k,m] + invVec)), 

          # this is an argument to cbind
          "newvar"=m) 
    )), 

    # the two lists you are passing as arguments are the columns of the expanded grid
    ixk$i, ixk$k, SIMPLIFY=FALSE
  )

# flatten the data frame
newdf <- do.call(rbind, newdf)



Two points to note:

(1) Try not to use words like data, table, df, sub etc which are commonly used functions In the above code I used mydf in place of data.

(2) You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation

Good luck, and welcome to SO

这篇关于如何在R中矢量化for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆