Apache PDFBOX - 使用拆分(PDDocument 文档)时出现 java.lang.OutOfMemoryError [英] Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

查看:82
本文介绍了Apache PDFBOX - 使用拆分(PDDocument 文档)时出现 java.lang.OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Apache PDFBOX API V2.0.2 拆分 300 页的文档.尝试使用以下代码将 pdf 文件拆分为单个页面时:

 PDDocument 文档 = PDDocument.load(inputFile);Splitter splitter = new Splitter();列表splittedDocuments = splitter.split(document);//这里发生异常

我收到以下异常

线程main"中的异常java.lang.OutOfMemoryError:超出GC开销限制

这表明 GC 需要花费大量时间来清除不符合回收量的堆.

有很多JVM调优方法可以解决这种情况,但是,所有这些都只是治标不治本.

最后一点,我使用的是 JDK6,因此在我的情况下不能使用新的 java 8 Consumer.谢谢

这不是 http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 的重复问题:<预>1.我没有上述提到的尺寸问题话题.我正在切片一个 270 页 13.8MB 的 PDF 文件,切片后每个切片的大小平均为 80KB,总大小为30.7MB.2. Split 在返回拆分的部分之前抛出异常.

我发现拆分可以通过,只要我不传递整个文档,而是将其作为批次"传递,每页 20-30 页,这可以完成工作.

解决方案

PDF Box 将拆分操作产生的部分作为 PDDocument 类型的对象存储在堆中作为对象,这导致堆很快被填满,即使你在循环中的每一轮之后调用 close() 操作,GC 仍然无法以与填充相同的方式回收堆大小.

一个选项是将文档拆分操作拆分成批处理,其中每个批处理是一个相对易于管理的块(10到40页)

public void execute() {File inputFile = new File(path/to/the/file.pdf);PDDocument 文档 = null;尝试 {文档 = PDDocument.load(inputFile);整数开始= 1;整数端 = 1;int batchSize = 50;int finalBatchSize = document.getNumberOfPages() % batchSize;int noOfBatches = document.getNumberOfPages()/batchSize;for (int i = 1; i <= noOfBatches; i++) {开始=结束;结束 = 开始 + 批量大小;System.out.println("批次:" + i + " 开始:" + 开始 + " 结束:" + 结束);拆分(文件,开始,结束);}//处理剩余的开始=结束;结束 += finalBatchSize;System.out.println("最后一批开始:" + start + " end: " + end);拆分(文件,开始,结束);} catch (IOException e) {e.printStackTrace();} 最后 {//关闭文档}}private void split(PDDocument document, int start, int end) 抛出 IOException {列表<文件>fileList = new ArrayList();Splitter splitter = new Splitter();splitter.setStartPage(start);splitter.setEndPage(end);列表splittedDocuments = splitter.split(document);String outputPath = Config.INSTANCE.getProperty("outputPath");PDFTextStripper 剥离器 = new PDFTextStripper();for (int index = 0; index 

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2. While trying to split the pdf file to single pages using the following code:

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

I receive the following exception

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.

There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.

One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks

Edit:

This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:

 1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.

解决方案

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.

An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

这篇关于Apache PDFBOX - 使用拆分(PDDocument 文档)时出现 java.lang.OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆