如何使用 Apache POI 加载大型 xlsx 文件? [英] How to load a large xlsx file with Apache POI?

查看:71
本文介绍了如何使用 Apache POI 加载大型 xlsx 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 .xlsx 文件(141 MB,包含 293413 行,每行 62 列)我需要在其中执行一些操作.

I have a large .xlsx file (141 MB, containing 293413 lines with 62 columns each) I need to perform some operations within.

我在加载此文件时遇到问题 (OutOfMemoryError),因为 POI 在 XSSF (xlsx) 工作簿上占用大量内存.

I am having problems with loading this file (OutOfMemoryError), as POI has a large memory footprint on XSSF (xlsx) workbooks.

这个问题 类似,提出的解决方案是增加 VM 的分配/最大内存.

This SO question is similar, and the solution presented is to increase the VM's allocated/maximum memory.

它似乎适用于那种文件大小(9MB),但对我来说,即使分配了所有可用的系统内存,它也根本不起作用.(好吧,考虑到文件大 15 倍以上,这并不奇怪)

It seems to work for that kind of file-size (9MB), but for me, it just simply doesn't work even if a allocate all available system memory. (Well, it's no surprise considering the file is over 15 times larger)

我想知道是否有任何方法可以以不会消耗所有内存的方式加载工作簿,并且无需进行基于(进入)XSSF 底层 XML 的处理.(换句话说,维护一个清教徒的 POI 解决方案)

I'd like to know if there is any way to load the workbook in a way it won't consume all the memory, and yet, without doing the processing based (going into) the XSSF's underlying XML. (In other words, maintaining a puritan POI solution)

如果没有困难,欢迎您说出来(没有.")并指出XML"解决方案的方法.

If there isn't tough, you are welcome to say it ("There isn't.") and point me the ways to a "XML" solution.

推荐答案

我在 Web 服务器环境中遇到了类似的情况.上传的典型大小约为 150k 行,从单个请求中消耗大量内存不会很好.Apache POI Streaming API 对此非常有效,但它需要对读取逻辑进行全面重新设计.我已经有一堆使用标准 API 的读取逻辑,我不想重做,所以我写了这个:https://github.com/monitorjbl/excel-streaming-reader

I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader

它并不完全是标准 XSSFWorkbook 类的替代品,但如果您只是遍历行,它的行为类似:

It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:

import com.monitorjbl.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .sheetIndex(0)        // index of sheet to use (defaults to 0)
        .read(is);            // InputStream or File for XLSX file (required)

for (Row r : reader) {
  for (Cell c : r) {
    System.out.println(c.getStringCellValue());
  }
}     

使用它有一些注意事项;由于 XLSX 表的结构方式,在流的当前窗口中并非所有数据都可用.但是,如果您只是想从单元格中读取简单的数据,那么它非常适用.

There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.

这篇关于如何使用 Apache POI 加载大型 xlsx 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆