从 R 写入 Excel 时处理 java.lang.OutOfMemoryError [英] Handling java.lang.OutOfMemoryError when writing to Excel from R

查看:27
本文介绍了从 R 写入 Excel 时处理 java.lang.OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

xlsx 包可用于从 R 读取和写入 Excel 电子表格.不幸的是,即使对于中等大小的电子表格,也可能发生 java.lang.OutOfMemoryError.尤其是

The xlsx package can be used to read and write Excel spreadsheets from R. Unfortunately, even for moderately large spreadsheets, java.lang.OutOfMemoryError can occur. In particular,

.jcall 中的错误("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java 堆空间

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space

.jcall 中的错误("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.lang.OutOfMemoryError: 超出 GC 开销限制

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.lang.OutOfMemoryError: GC overhead limit exceeded

(其他相关的例外也是可能的,但比较少见.)

(Other related exceptions are also possible but rarer.)

在阅读电子表格时针对此错误提出了类似的问题.

A similar question was asked regarding this error when reading spreadsheets.

将一个大的 xlsx 文件导入 R?

与 CSV 相比,使用 Excel 电子表格作为数据存储介质的主要优点是您可以在同一个文件中存储多个工作表,因此这里我们考虑将数据框列表写入每个工作表的一个数据框.此示例数据集包含 40 个数据框,每个数据框有两列,最多 20 万行.它被设计成足够大的问题,但您可以通过更改 n_sheetsn_rows 来更改大小.

The main advantage of using Excel spreadsheets as a data storage medium over CSV is that you can store multiple sheets in the same file, so here we consider a list of data frames to be written one data frame per worksheet. This example dataset contains 40 data frames, each with two columns of up to 200k rows. It is designed to be big enough to be problematic, but you can change the size by altering n_sheets and n_rows.

library(xlsx)
set.seed(19790801)
n_sheets <- 40
the_data <- replicate(
  n_sheets,
  {
    n_rows <- sample(2e5, 1)
    data.frame(
      x = runif(n_rows),
      y = sample(letters, n_rows, replace = TRUE)
    )
  },
  simplify = FALSE
)
names(the_data) <- paste("Sheet", seq_len(n_sheets))

将此写入文件的自然方法是使用 创建工作簿createWorkbook,然后循环调用 createSheet<代码>添加数据帧.最后,可以使用 saveWorkbook 将工作簿写入文件.我已将消息添加到循环中,以便更轻松地查看它在哪里落下.

The natural method of writing this to file is to create a workbook using createWorkbook, then loop over each data frame calling createSheet and addDataFrame. Finally the workbook can be written to file using saveWorkbook. I've added messages to the loop to make it easier to see where it falls over.

wb <- createWorkbook()  
for(i in seq_along(the_data))
{
  message("Creating sheet", i)
  sheet <- createSheet(wb, sheetName = names(the_data)[i])
  message("Adding data frame", i)
  addDataFrame(the_data[[i]], sheet)
}
saveWorkbook(wb, "test.xlsx")  

在具有 8GB RAM 的机器上以 64 位运行它,它在第一次运行 addDataFrame 时抛出 GC 开销限制超出 错误.

Running this in 64-bit on a machine with 8GB RAM, it throws the GC overhead limit exceeded error while running addDataFrame for the first time.

如何使用 xlsx 将大型数据集写入 Excel 电子表格?

How do I write large datasets to Excel spreadsheets using xlsx?

推荐答案

这是一个已知问题:http://code.google.com/p/rexcel/issues/detail?id=33

虽然未解决,但问题页面 链接到解决方案 by Gabor Grothendieck 建议应该通过设置 java.parameters 来增加堆大小 选项在加载 rJava 包之前.(rJavaxlsx 的依赖.)

While unresolved, the issue page links to a solution by Gabor Grothendieck suggesting that the heap size should be increased by setting the java.parameters option before the rJava package is loaded. (rJava is a dependency of xlsx.)

options(java.parameters = "-Xmx1000m")

1000 是允许用于 Java 堆的 RAM 的兆字节数;它可以替换为您喜欢的任何值.我对此的实验表明,更大的值更好,您可以愉快地使用完整的 RAM 权利.例如,我使用以下方法获得了最佳结果:

The value 1000 is the number of megabytes of RAM to allow for the Java heap; it can be replaced with any value you like. My experiments with this suggest that bigger values are better, and you can happily use your full RAM entitlement. For example, I got the best results using:

options(java.parameters = "-Xmx8000m")

在具有 8GB RAM 的机器上.

on the machine with 8GB RAM.

通过在循环的每次迭代中请求垃圾回收,可以获得进一步的改进.正如@gjabel 所指出的,R 垃圾收集可以使用 gc().我们可以定义一个 Java 垃圾收集函数,它调用 Java System.gc() 方法:

A further improvement can be obtained by requesting a garbage collection in each iteration of the loop. As noted by @gjabel, R garbage collection can be performed using gc(). We can define a Java garbage collection function that calls the Java System.gc() method:

jgc <- function()
{
  .jcall("java/lang/System", method = "gc")
}    

然后循环可以更新为:

for(i in seq_along(the_data))
{
  gc()
  jgc()
  message("Creating sheet", i)
  sheet <- createSheet(wb, sheetName = names(the_data)[i])
  message("Adding data frame", i)
  addDataFrame(the_data[[i]], sheet)
}

通过这两个代码修复,代码在抛出错误之前运行了 i = 29.

With both these code fixes, the code ran as far as i = 29 before throwing an error.

我尝试失败的一种技术是使用 write.xlsx2 在每次迭代时将内容写入文件.这比其他代码慢,并且在第 10 次迭代时失败(但至少部分内容已写入文件).

One technique that I tried unsuccessfully was to use write.xlsx2 to write the contents to file at each iteration. This was slower than the other code, and it fell over on the 10th iteration (but at least part of the contents were written to file).

for(i in seq_along(the_data))
{
  message("Writing sheet", i)
  write.xlsx2(
    the_data[[i]], 
    "test.xlsx", 
    sheetName = names(the_data)[i], 
    append    = i > 1
  )
}

这篇关于从 R 写入 Excel 时处理 java.lang.OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆