R 性能与数据重塑 [英] R performance with data reshaping
问题描述
我正在尝试在 R 中重塑数据框,但使用推荐的方法似乎有问题.数据框的结构如下:
I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:
ID DATE1 DATE2 VALTYPE VALUE
'abcd1233' 2009-11-12 2009-12-23 'TYPE1' 123.45
...
VALTYPE
是一个字符串,是一个只有 2 个值的因子(比如 TYPE1
和 TYPE2
).我需要根据公共 ID 和日期将其转换为以下数据框(宽"转置):
VALTYPE
is a string and is a factor with only 2 values (say TYPE1
and TYPE2
). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:
ID DATE1 DATE2 VALUE.TYPE1 VALUE.TYPE2
'abcd1233' 2009-11-12 2009-12-23 123.45 NA
...
该数据框有超过 4,500,000 个观察值(尽管大约 70% 的 VALUE
是 NA
).该机器是基于 Intel 的 Linux 工作站,具有 4Gb 的 RAM.将数据(来自压缩的 Rdata 文件)加载到新的 R 进程中使其增长到大约 250Mb,这显然为重塑留出了很多空间.
The data frame has more than 4,500,000 observations (although about 70% of VALUE
s are NA
). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.
这些是我目前的经历:
使用原版
reshape()
方法:
tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),timevar = "VALTYPE");
tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");
结果:错误:无法分配大小为 4.8 Gb 的向量
使用
reshape
包的cast()
方法:
tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);
tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);
结果:R 进程消耗了所有 RAM,而且看不到尽头.最终不得不终止进程.
RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.
使用
by()
和merge()
:
sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x);tbl <-合并(sp[["TYPE1"]], sp[["TYPE2"]],by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);
sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);
RESULT:工作正常,虽然这不是很优雅和万无一失(即如果添加更多类型它会中断).
RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).
雪上加霜的是,所讨论的操作可以在大约 3 行 AWK 或 Perl 中轻松实现(并且几乎不使用任何 RAM).所以问题是:使用推荐的方法在 R 中执行此操作而不消耗所有可用 RAM 的更好方法是什么?
To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?
推荐答案
也许你可以使用 cat() 函数?
Maybe you could use the cat() function?
这篇关于R 性能与数据重塑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!