在 R 中,如何真正快速地遍历数据帧的行? [英] In R, how do you loop over the rows of a data frame really fast?

查看:19
本文介绍了在 R 中,如何真正快速地遍历数据帧的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一个包含多行多列的数据框.

Suppose that you have a data frame with many rows and many columns.

列有名称.您想按数字访问行,按名称访问列.

The columns have names. You want to access rows by number, and columns by name.

例如,一种(可能很慢)循环遍历行的方法是

For example, one (possibly slow) way to loop over the rows is

for (i in 1:nrow(df)) {
  print(df[i, "column1"])
  # do more things with the data frame...
}

另一种方法是为单独的列创建列表"(如 column1_list = df[["column1"]),并在一个循环中访问列表.这种方法可能很快,但如果您想访问许多列,也很不方便.

Another way is to create "lists" for separate columns (like column1_list = df[["column1"]), and access the lists in one loop. This approach might be fast, but also inconvenient if you want to access many columns.

是否有一种快速循环数据框行的方法?其他一些数据结构是否更适合快速循环?

Is there a fast way of looping over the rows of a data frame? Is some other data structure better for looping fast?

推荐答案

我想我需要给出一个完整的答案,因为我发现评论更难追踪,而且我已经失去了一条评论......有一个例子nullglob 展示了 for 之间的差异,并且比其他示例更好地应用了族函数.当一个函数使得它非常慢时,这就是所有速度都被消耗的地方,你不会发现循环变化之间的差异.但是当你让函数变得微不足道时,你就会看到循环对事物的影响有多大.

I think I need to make this a full answer because I find comments harder to track and I already lost one comment on this... There is an example by nullglob that demonstrates the differences among for, and apply family functions much better than other examples. When one makes the function such that it is very slow then that's where all the speed is consumed and you won't find differences among the variations on looping. But when you make the function trivial then you can see how much the looping influences things.

我还想补充一点,在其他示例中未探索的 apply 系列的一些成员具有有趣的性能属性.首先,我将在我的机器上展示 nullglob 的相关结果的复制.

I'd also like to add that some members of the apply family unexplored in other examples have interesting performance properties. First I'll show replications of nullglob's relative results on my machine.

n <- 1e6
system.time(for(i in 1:n) sinI[i] <- sin(i))
  user  system elapsed 
 5.721   0.028   5.712 

lapply runs much faster for the same result
system.time(sinI <- lapply(1:n,sin))
   user  system elapsed 
  1.353   0.012   1.361 

他还发现 sapply 慢得多.这是其他一些未经测试的.

He also found sapply much slower. Here are some others that weren't tested.

普通旧适用于数据的矩阵版本...

Plain old apply to a matrix version of the data...

mat <- matrix(1:n,ncol =1),1,sin)
system.time(sinI <- apply(mat,1,sin))
   user  system elapsed 
  8.478   0.116   8.531 

因此,apply() 命令本身比 for 循环慢很多.(如果我使用 sin(mat[i,1]),for 循环不会明显减慢.

So, the apply() command itself is substantially slower than the for loop. (for loop is not slowed down appreciably if I use sin(mat[i,1]).

另一个似乎没有在其他帖子中测试过的是tapply.

Another one that doesn't seem to be tested in other posts is tapply.

system.time(sinI <- tapply(1:n, 1:n, sin))
   user  system elapsed 
 12.908   0.266  13.589 

当然,人们永远不会以这种方式使用 tapply,而且它的实用性在大多数情况下远远超出任何此类速度问题.

Of course, one would never use tapply this way and it's utility is far beyond any such speed problem in most cases.

这篇关于在 R 中,如何真正快速地遍历数据帧的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆