R 性能与数据重塑 [英] R performance with data reshaping

查看:46
本文介绍了R 性能与数据重塑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 R 中重塑数据框,但使用推荐的方法似乎有问题.数据框的结构如下:

I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:

ID                     DATE1             DATE2            VALTYPE        VALUE
'abcd1233'         2009-11-12        2009-12-23           'TYPE1'        123.45
...

VALTYPE 是一个字符串,是一个只有 2 个值的因子(比如 TYPE1TYPE2).我需要根据公共 ID 和日期将其转换为以下数据框(宽"转置):

VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:

ID                     DATE1             DATE2            VALUE.TYPE1  VALUE.TYPE2
'abcd1233'             2009-11-12        2009-12-23       123.45           NA
...

该数据框有超过 4,500,000 个观察值(尽管大约 70% 的 VALUENA).该机器是基于 Intel 的 Linux 工作站,具有 4Gb 的 RAM.将数据(来自压缩的 Rdata 文件)加载到新的 R 进程中使其增长到大约 250Mb,这显然为重塑留出了很多空间.

The data frame has more than 4,500,000 observations (although about 70% of VALUEs are NA). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.

这些是我目前的经历:

  • 使用原版 reshape() 方法:

tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),timevar = "VALTYPE");

tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");

结果:错误:无法分配大小为 4.8 Gb 的向量

  • 使用reshape包的cast()方法:

tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);

tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);

结果:R 进程消耗了所有 RAM,而且看不到尽头.最终不得不终止进程.

RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.

  • 使用 by()merge():

sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x);tbl <-合并(sp[["TYPE1"]], sp[["TYPE2"]],by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);

sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);

RESULT:工作正常,虽然这不是很优雅和万无一失(即如果添加更多类型它会中断).

RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).

雪上加霜的是,所讨论的操作可以在大约 3 行 AWK 或 Perl 中轻松实现(并且几乎不使用任何 RAM).所以问题是:使用推荐的方法在 R 中执行此操作而不消耗所有可用 RAM 的更好方法是什么?

To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?

推荐答案

也许你可以使用 cat() 函数?

Maybe you could use the cat() function?

这篇关于R 性能与数据重塑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆