SXSSFWorkbook.write to FileOutputStream 写入大文件 [英] SXSSFWorkbook.write to FileOutputStream writes huge files

查看:222
本文介绍了SXSSFWorkbook.write to FileOutputStream 写入大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 SXSSFWorkbook 从头开始​​编写 Excel 电子表格.

I'm trying to use SXSSFWorkbook to write an Excel spreadsheet from scratch.

      wb = SXSSFWorkbook(500)
      wb.isCompressTempFiles = true
      sh = streamingWorkbook.createSheet(t.getMessage("template.sheet.name"))

一切都很好,但是当我调用最终代码时:

All is fine but when I invoke the final code:

    val out = FileOutputStream(localPath)
    wb.write(out)
    out.close()
    // dispose of temporary files backing this workbook on disk
    wb.dispose()

我得到了一个巨大的 excel 文件,而不是我期待的压缩 XLSX.我尝试手动压缩文件,从 120MB 的文件中我可以将其压缩到 9MB.那么我错过了什么?

I get a huge excel file instead of the zipped XLSX I'm expecting. I've tried to manually zip the file and from a 120MB file I can get it down to 9MB. So what am I missing?

使用:Kotlin 和

Using: Kotlin and

    implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.2'  // For `.xlsx` files

-- 更新 1

我的印象是 xlsx 本质上是包含 xml 数据的压缩文件 [1].通过 XSSFWorkbook 和 SXSSFWorkbook 输出的 POI 至少可以压缩 10 个数量级.我用这个简单的代码来演示:

I was under the impression that xlsx are zipped files containing xml data in essence [1]. What POI outputs through both XSSFWorkbook and SXSSFWorkbook can be compressed at least by an order of magnitude of 10. I've used this simple code to demonstrate:

fun main() {
  val workbook = XSSFWorkbook()
  writeRowsAndSave(workbook, "test.xlsx")
  workbook.close()

  val streamingWorkbook = SXSSFWorkbook(IN_MEMORY_ROWS_WINDOW_SIZE)
  streamingWorkbook.isCompressTempFiles = true
  writeRowsAndSave(streamingWorkbook, "test-streaming.xlsx")
  streamingWorkbook.dispose()
}

private fun writeRowsAndSave(workbook: Workbook, fileName: String) {
  val ROWS_COUNT = 2_000
  val COLS_COUNT = 1_000

  val sheet = workbook.createSheet("Test Sheet 1")
  for (i in 1..ROWS_COUNT) {
    val row = sheet.createRow(i)
    println("Row $i")
    for(j in 1..COLS_COUNT) {
        row.createCell(j).setCellValue("Test $i")
    }
  }

  FileOutputStream("./$fileName").use {
      workbook.write(it)
  }
}

这会产生 5MB-每个文件,压缩后大约有 439KB (?!).

This produces 5MB-each files that when compressed have roughly 439KB (?!).

推荐答案

SXSSFWorkbook 默认使用内联字符串而不是共享字符串表.这意味着 SXSSFWorkbook 直接在工作表中写入文本,即使它是同一文本的多次.XSSFWorkbook 和 Excel 的 GUI 都使用共享字符串表,其中文本获取索引,相同的文本仅存储一次,然后在工作表中使用索引.但这应该不会对生成的 *.xlsx 的文件大小产生太大影响.

SXSSFWorkbook defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx.

SXSSFWorkbook,以及所有其他 Office Open XML 格式的文件,apache poi 创建,使用 org.apache 压缩.commons.compress.archivers.zip.ZipArchiveOutputStream.它使用 deflate 作为压缩算法和 Deflater.DEFAULT_COMPRESSION 作为默认压缩级别.可以覆盖 SXSSFWorkbookprotected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) 以设置另一个压缩级别.但这也不会对结果 *.xlsx 的文件大小产生太大影响.

SXSSFWorkbook, as well as all other Office Open XML formatted files, apache poi creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) of SXSSFWorkbook to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx.

示例 Java 代码:

import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;

import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;

class CreateSXSSFDifferentCompression {

 static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize, 
                                          boolean compressTmpFiles, boolean useSharedStringsTable) {
  SXSSFWorkbook workbook = null;
  if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
    protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
     ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
     zos.setUseZip64(Zip64Mode.AsNeeded);  
     zos.setLevel(compressionLevel);
     return zos;
    }    
   }; 
  } else {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
  }
  return workbook;
 }

 public static void main(String[] args) throws Exception {

  SXSSFWorkbook workbook = null;

  // uses Deflater.DEFAULT_COMPRESSION and inline strings
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false); 

  // uses Deflater.DEFAULT_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true); 

  // uses Deflater.BEST_COMPRESSION and inline strings
  workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false); 

  // uses Deflater.BEST_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true); 

  int ROWS_COUNT = 2000;
  int COLS_COUNT = 1000;

  Sheet sheet = workbook.createSheet("Test Sheet 1");
  for (int i = 1 ; i <= ROWS_COUNT; i++) {
   Row row = sheet.createRow(i);
   //System.out.println("Row " + i);
   for(int j = 1; j <= COLS_COUNT; j++) {
    row.createCell(j).setCellValue("Test " + i);
   }
  }

  FileOutputStream out = new FileOutputStream("./Excel.xlsx");
  workbook.write(out);
  out.close();
  workbook.close();
  workbook.dispose();

  File file = new File("./Excel.xlsx");
  System.out.println(file.length());

 }
}

这导致 Excel.xlsx 文件大小为:

This results in Excel.xlsx file size of:

5,031,034 字节当使用 Deflater.DEFAULT_COMPRESSION 和内联字符串时.

5,031,034 bytes when Deflater.DEFAULT_COMPRESSION and inline strings are used.

4,972,663 字节当使用 Deflater.DEFAULT_COMPRESSION 和共享字符串表时.

4,972,663 bytes when Deflater.DEFAULT_COMPRESSION and shared strings table are used.

4,972,915 字节当使用 Deflater.BEST_COMPRESSION 和内联字符串时.

4,972,915 bytes when Deflater.BEST_COMPRESSION and inline strings are used.

还有4,966,749 字节当使用 Deflater.BEST_COMPRESSION 和共享字符串表时.

And 4,966,749 bytes when Deflater.BEST_COMPRESSION and shared strings table are used.

使用:Java 12apache poi 4.1.2Ubuntu Linux.

对于 2,000 行 x 1,000 列的电子表格,我既不会称其为巨大,也不会称不同设置的影响很大.

Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.

而且条目压缩得很好.

如果您查看 Excel.xlsx ZIP 存档,您会发现使用内联字符串时 xl/worksheets/sheet1.xml 的未压缩大小为 112,380,273 字节.xl/sharedStrings.xml 的未压缩大小为 138 字节,仅包含非常基本的 XML.

If you look into the Excel.xlsx ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml 112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml is 138 bytes then and only contains the very basic XML.

如果使用共享字符串表,则xl/worksheets/sheet1.xml未压缩大小为68,377,273字节,xl/sharedStrings.xml未压缩大小为49,045 字节,包含 2,000 个条目.

If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml is 49,045 bytes and contains 2,000 entries.

如果 Excel 本身保存 *.xlsx 文件,它会在内容相同时创建文件大小大致相同的文件.所以 Excel 本身使用相同的压缩级别.

If Excel itself saves *.xlsx files, it creates files having approximately the same file size when content is equal. So Excel itself uses the same compression level.

当然,当再次将 Excel.xlsx 存储到 *.zip 存档时,可以更多地压缩 *.xlsx 文件.但这不会是 Excel 期望的 *.xlsx 文件.

Of course one can compressing the *.xlsx files more when storing the Excel.xlsx into a *.zip archive again. But that would not be what Excel expects to be a *.xlsx file.

Microsoft 声明于 开放 XML 格式有什么好处?:

压缩文件 文件会自动压缩,在某些情况下最多可缩小 75%.Open XML 格式使用 zip压缩技术来存储文件,提供潜在的成本节省,因为它减少了存储文件所需的磁盘空间和减少通过电子邮件发送文件所需的带宽,超过网络,并通过 Internet.当你打开一个文件时,它是自动解压.当您保存文件时,它会自动再次拉上拉链.您不必安装任何特殊的 zip 实用程序来在 Office 中打开和关闭文件.

Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip compression technology to store documents, offering potential cost savings as it reduces the disk space required to store files and decreases the bandwidth needed to send files via e-mail, over networks, and across the Internet. When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again. You do not have to install any special zip utilities to open and close files in Office.

这里的重要部分是:

当您打开文件时,它会自动解压缩.当你保存一个文件,它会自动再次压缩.

When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again.

这意味着,如果 apache poi 将使用其他方法或压缩级别而不是 Microsoft Office 本身来压缩文件,那么 Microsoft Office 将是无法使用 apache poi 创建的文件.

That means, if apache poi would compress the files using other methods or compression levels than Microsoft Office itself, then Microsoft Office would be unable to do so with files apache poi had created.

因此,由于 apache poi 创建的文件是 Excel (Microsoft Office) 可以直接打开的,所以它使用相同的压缩方法和压缩级别与 Excel (Microsoft Office) 相同.

So, since apache poi creates files which Excel (Microsoft Office) is able opening directly, it uses the same compression method and compression level as Excel (Microsoft Office) will do.

这篇关于SXSSFWorkbook.write to FileOutputStream 写入大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆