R - 为什么要向数据表添加1列,使用的峰值内存几乎翻倍? [英] R - Why adding 1 column to data table nearly doubles peak memory used?

查看:189
本文介绍了R - 为什么要向数据表添加1列,使用的峰值内存几乎翻倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

获得



[从Github问题1062引用@Arun ...]


固定在R v3.2,此项目来自NEWS:


当打印发送到某个方法时,自动打印不再复制对象。



因此,有这个问题的其他人应该升级到R 3.2。


After getting help from 2 kind gentlemen, I managed to switch over to data tables from data frame+plyr.

The Situation and My Questions

As I worked on, I noticed that peak memory usage nearly doubled from 3.5GB to 6.8GB (according to Windows Task Manager) when I added 1 new column using := to my data set containing ~200K rows by 2.5K columns.

I then tried 200M row by 25 col, the increase was from 6GB to 7.6GB before dropping to 7.25GB after a gc().

Specifically regarding adding of new columns, Matt Dowle himself mentioned here that:

With its := operator you can :

Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference

None of these operations copy the (potentially large) data.table at all, not even once.

Question 1: why would adding a single column of 'NAs' for a DT with 2.5K columns double the peak memory used if the data.table is not copied at all?

Question 2: Why does the doubling not occur when the DT is 200M x 25? I didn't include the printscreen for this, but feel free to change my code and try.

Printscreens for Memory Usage using Test Code

  1. Clean re-boot, RStudio & MS Word opened - 103MB used

  2. Aft running DT creation code but before adding column - 3.5GB used

  3. After adding 1 Column filled with NA, but before gc() - 6.8GB used

  4. After running gc() - 3.5GB used

Test Code

To investigate, I did up the following test code that closely mimics my data set:

library(data.table)
set.seed(1)

# Credit: Dirk Eddelbuettel's answer in 
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
RandDate <- function(N, st="2000/01/01", et="2014/12/31") { 
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- runif(N, 0, dt)
  rt <- as.character( strptime(st + ev, "%Y-%m-%d") )
}

# Create Sample data
TotalNoCol <- 2500
TotalCharCol <- 3
TotalDateCol <- 1
TotalIntCol <- 600
TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol
nrow <- 200000

ColNames = paste0("C", 1:TotalNoCol)

dt <- as.data.table( setNames( c(

  replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( TotalDateCol, RandDate( nrow ), simplify = F ), 
  replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ), 
  replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ), 

    ColNames ) )

gc()

# Add New columns, to be run separately
dt[, New_Col := NA ]  # Additional col; uses excessive memory?

Research Done

I didn't find too much discussion on memory usage for DT with many columns, only this but even then, it's not specifically about memory.

Most discussions on large dataset + memory usage involves DTs with very large rowcount but relatively few columns.

My System

Intel i7-4700 with 4-core/8-thread; 16GB DDR3-12800 RAM; Windows 8.1 64-bit; 500GB 7200rpm HDD; 64-bit R; Data Table ver 1.9.4

Disclaimers

Please pardon me for using a 'non-R' method (i.e. Task Manager) to measure memory used. Memory measurement/profiling in R is something I still haven't figured out.


Edit 1: After updating to data table ver 1.9.5 and re-running. Issue persisted, unfortunately.

解决方案

(I can take no credit as the great DT minds (Arun) have been working on this and found it was related to print.data.table. Just closing the loop here for other SO users.)

It seems this data.table memory issue with :=has been solved on R version 3.2 as noted by: https://github.com/Rdatatable/data.table/issues/1062

[Quoting @Arun from Github issue 1062...]

fixed in R v3.2, IIUC, with this item from NEWS:

Auto-printing no longer duplicates objects when printing is dispatched to a method.

So others with this problem should look to upgrading to R 3.2.

这篇关于R - 为什么要向数据表添加1列,使用的峰值内存几乎翻倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆