缓慢的data.frame行分配 [英] Slow data.frame row assignation

查看:239
本文介绍了缓慢的data.frame行分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用RMongoDB,我需要使用查询的值填充一个空的data.frame。结果相当长,大约有2万个文件(行)。



在进行性能测试时,我发现将值写入行的时间会随着数据帧的维数而增加。也许这是一个众所周知的问题,我是最后一个注意的问题。



一些代码示例:

  set.seed(20140430)
nreg< - 2e3
dfres< - as.data.frame(matrix(rep(NA,nreg * 7),nrow = nreg,ncol = 7))
system.time dfres [1e3,]< - c(1:5,a,b))
summary(replicate(10,system.time(dfres [sample(1:nreg,1),] ; - c(1:5,a,b))[3]))

nreg < - 2e6
dfres< - as.data.frame (d(n,n,n,7))n
system.time(dfres [1e3,]< - c(1:5,a,b))
summary(replicate(10,system.time(dfres [sample(1:nreg,1),]< - c(1:5,a,b))[3]))

在我的机器上,2 milion行data.frame中的赋值大约需要0.4秒。如果我想填写整个数据集,这是很多时间。这里进行第二次模拟以便画出问题。

  nreg < -  seq(2e1,2e7,length.out = 10)
te < - NULL
for(i in nreg){
dfres < - as.data.frame(matrix(rep(NA,i * 7)),nrow = i ,ncol = 7))
te < - c(te,mean(replicate(10,{r }
plot(nreg,te,xlab =行数,ylab =Avg。时间为10个随机分配[sec],type =o)
#rm(nreg,dfres,te)



问题:为什么会这样?有没有更快的方法来填充内存中的data.frame?

解决方案

先从列开始,看看什么然后返回行。



R版本< 3.1.0(不必要地)复制整个 data.frame 。例如:

  ## R v3.0.3 
df< - data.frame(x = 1:5 ,y = 6:10)
dplyr :::更改(df,transform(df,z = 11:15))##要求dplyr可用

#更改变量:
#old new
#x 0x7ff9343fb4d0 0x7ff9326dfba8
#y 0x7ff9343fb488 0x7ff9326dfbf0
#z< added> 0x7ff9326dfc38

#更改属性:
#旧新
#名称0x7ff934170c28 0x7ff934308808
#row.names 0x7ff934551b18 0x7ff934308970
#class 0x7ff9346c5278 0x7ff935d1d1f8

您可以看到,添加新列已导致旧列的副本(地址为不同)。还要复制属性。最糟糕的是这些副本是深层复制,而不是浅层复制。


浅拷贝只复制列指针的向量,而不是整个数据,其中深拷贝复制所有内容(这里不必要)。


然而,在R v3.1.0中,有一个很好的欢迎变化,旧列不是深入复制。 R核心开发团队的所有学分。

  ## R v3.1.0 
df< - data.frame (x = 1:5,y = 6:10)
dplyr :::更改(df,transform(df,z = 11:15))##要求dplyr可用

#更改变量:
#old new
#z< added> 0x7f85d328dda8

#已更改属性:
#旧新
#名称0x7f85d1459548 0x7f85d297bec8
#row.names 0x7f85d2c66cd8 0x7f85d2bfa928
#class 0x7f85d345cab8 0x7f85d2d6afb8

您可以看到列 x y 根本没有更改(因此不会出现在更改函数调用的输出中)。这是一个巨大(和欢迎)的改进!



到目前为止,我们研究了在R< 3.1.0和v3.1.0



< hr>

现在,提出你的问题:那么行呢?我们先考虑旧版本的R,然后再回到R v3.1.0。

  ## R v3.0.3 
df< - data.frame(x = 1:5,y = 6:10)
df.old< - df
df $ y [1L]< - -6L
dplyr :::更改(df.old,df)

#更改的变量:
#旧新
#x 0x7f968b423e50 0x7f968ac6ba40
#y 0x7f968b423e98 0x7f968ac6bad0

#更改属性:
#旧新
#名称0x7f968ab88a28 0x7f968abca8e0
#row.names 0x7f968abb6438 0x7f968ab22bb0
#class 0x7f968ad73e08 0x7f968b580828

再次看到,更改列 y 导致复制列 x 以及旧版本的R。

  ## R v3。 1.0 
df< - data.frame(x = 1:5,y = 6:10)
df.old< - df
df $ y [1L] 6L
dplyr :::更改(df.old,df)

#更改变量:
#o ld new
#y 0x7f85d3544090 0x7f85d2c9bbb8

#更改属性:
#旧新
#row.names 0x7f85d35a69a8 0x7f85d35a6690

我们看到R v3.1.0中的不错的改进,导致了只是列的副本 y 。再次,R v3.1.0中的巨大改进! R的拷贝修改已经变得更加清晰了。


但是,使用 data.table 语义,我们可以做一个更好的步骤,而不是复制即使是 y 列,就像R v3中的情况一样。 1.0。



想法是:只要您在特定索引中分配给列的对象的类型不会更改(这里,列 y 是整数 - 只要将整数分配回 y ),我们真正可以在没有通过修改就地(通过引用)复制。



为什么?因为我们不必在这里分配/重新分配任何东西。例如,如果您分配了一个双/数字类型,这对于整数列 y 需要8个字节的存储空间,而不是4个字节的存储空间,那么我们创建一个新列 y 并复制值。


我们可以使用 data.table 通过引用进行子分配。我们可以使用:= set()来执行此操作。我将在这里演示使用 set()



现在,这里是与基本R和 data.table 对于数据,2,000到20,000,000行的倍数为10,分别为R v3.0.3和v3.1.0。 中看到完整的时间。 p>

这清楚地显示了R v3.1.0的改进,但也显示正在更改的列仍然被复制,并且仍然消耗一些时间,这通过 data.table 中的 通过 。



HTH


I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).

While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.

Some code example:

set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <-  c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))

nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <-  c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))

On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.

nreg <- seq(2e1,2e7,length.out=10)
te <- NULL 
for(i in nreg){
    dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
    te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)

Question: Why this happens? Is there a quicker way to fill the data.frame in memory?

解决方案

Let's start with "columns" first and see what goes on and then return to rows.

R versions < 3.1.0 (unnecessarily) copies the entire data.frame when you operate on them. For example:

## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available

# Changed variables:
#           old            new           
# x         0x7ff9343fb4d0 0x7ff9326dfba8
# y         0x7ff9343fb488 0x7ff9326dfbf0
# z         <added>        0x7ff9326dfc38

# Changed attributes:
#           old            new           
# names     0x7ff934170c28 0x7ff934308808
# row.names 0x7ff934551b18 0x7ff934308970
# class     0x7ff9346c5278 0x7ff935d1d1f8

You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.

Shallow copies only copy the vector of column pointers, not the entire data, where as deep copies copy everything (which is unnecessary here).

However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.

## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available

# Changed variables:
#           old     new           
# z         <added> 0x7f85d328dda8

# Changed attributes:
#           old            new           
# names     0x7f85d1459548 0x7f85d297bec8
# row.names 0x7f85d2c66cd8 0x7f85d2bfa928
# class     0x7f85d345cab8 0x7f85d2d6afb8

You can see that the columns x and y aren't changed at all (and therefore not present in the output of changes function call). This is a huge (and welcoming) improvement!

So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0


Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.

## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)

# Changed variables:
#           old            new           
# x         0x7f968b423e50 0x7f968ac6ba40
# y         0x7f968b423e98 0x7f968ac6bad0
# 
# Changed attributes:
#           old            new           
# names     0x7f968ab88a28 0x7f968abca8e0
# row.names 0x7f968abb6438 0x7f968ab22bb0
# class     0x7f968ad73e08 0x7f968b580828

Once again we see that changing column y has resulted in copying column x as well in older versions of R.

## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)

# Changed variables:
#           old            new           
# y         0x7f85d3544090 0x7f85d2c9bbb8
# 
# Changed attributes:
#           old            new           
# row.names 0x7f85d35a69a8 0x7f85d35a6690

We see the nice improvements in R v3.1.0 which has resulted in the copy of just column y. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.

But still, using data.table's assignment by reference semantics, we can do one step better - not copy even the y column as is the case in R v3.1.0.

The idea being: as long as the type of the object you assign to a column at certain indices don't change (here, column y is integer - so as long as you assign an integer back to y), we really can do it without having to copy by modifying in-place (by reference).

Why? Because we don't have to allocate/re-allocate anything here. As an example, if you had assigned a double/numeric type, which requires 8 bytes of storage as opposed to 4-bytes of storage for integer column y, then we've to create a new column y and copy values back.

That is, we can sub-assign by reference using data.table. We can use either := or set() to do this. I'll demonstrate using set() here.

Now, here's a comparison with base R and data.table on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.

Plot for comparison against R v3.0.3:

Plot for comparison against R v3.1.0:

The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:

      type    min  median    max
base_3.0.3  10.05   10.70  18.51
base_3.1.0   1.67    1.97   5.20
data.table   0.04    0.04   0.05

Note: You can see the complete timings in this gist.

This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in data.table.

HTH

这篇关于缓慢的data.frame行分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆