分配给大R数据帧 [英] Assignment to big R data frame

查看:164
本文介绍了分配给大R数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中,我以下面的方式创建一个数据框:

$ $ $ $ $ $ $ data data data $ number = 0
data $ another = 1

当我运行一个为数据框赋值的for循环(迭代遍历行)时,我的代码运行速度无限缓慢

 <$ c ()函数()函数()函数()函数()函数(){2} 
){
data [i,2] = calculation()
data [i,3] = somethingElse()
}

上面的代码片段在我的笔记本上运行了20秒。在C或Java等其他语言中,这会立即结束。 R为什么这么慢?我记得读到R存储矩阵列(不像C,例如,它是逐行的)。但是,为什么需要这么多时间,我却感到困惑。难道我的data.frame不能很好地适应内存(避免慢速磁盘写入行为)?

作为我的问题的延续,我想问一个问题方法来填充我的数据框的行,如果存在一个。

编辑:
请注意,我不是要分配常量2和3我的数据框,在我试图解决计算()和somethingElse()的实际问题有点复杂,并依赖于另一个数据框。我的问题是关于在循环中高效地插入数据框(我也很好奇为什么这是如此之慢)。 解决方案

答案是矢量化:

$ p $ data [,2] = 2
data [,3] = 3

即刻完成。对于像R这样的解释语言的循环来说,速度很慢。通过直接分配矢量来执行这种操作(即矢量化)要快得多。

用新语言编程需要新的思维方式。你的方法呼吸一种编译语言,不需要for循环。

In R, I create a data frame in a following way:

data <- data.frame(dummy=rep('dummy',10000))
data$number = 0
data$another = 1

When I run a for loop that assigns values to data frame (iterating through rows), my code runs infinitely slow

calculation <- function() {2}
somethingElse <- function() {3}

system.time(
 for (i in 1:10000) {
   data[i,2]=calculation()
   data[i,3]=somethingElse()
 }
)

The above snippet runs in 20 seconds on my laptop. In other languages like C or Java, this finishes instantly. Why is it so slow in R? I remember reading that R stores matrices column by column (unlike C, for example, where it's row by row). But still, I'm puzzled about why it takes so much time. Shouldn't my data.frame fit comfortably in memory (eluding slow disk write behavior)?

As a continuation of my question, I'd like to ask for a quick way to fill my data frame by row, if there exists one.

EDIT: Please note that I'm not trying to assign constants 2 and 3 to my data frame, in the actual problem that I was trying to solve calculation() and somethingElse() are a bit more complicated and depend on another data frame. My question is about efficient insertion into data frame in a loop (and I'm also curious about why this is so slow).

解决方案

The answer is vectorization:

data[,2] = 2
data[,3] = 3

finishes instantly for me. For loops in interpreted languages like R are veeeeery slow. Performing this kind of operation by assigning a vector directly (i.e. vectorized) is much, much faster.

Programming in a new language requires a new mindset. Your approach breathes a compiled language, no need for the for loop.

这篇关于分配给大R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆