使用PDFBOX拆分和合并pdf文件会生成大文件 [英] Split and merge pdf files using PDFBOX produces large file

查看:355
本文介绍了使用PDFBOX拆分和合并pdf文件会生成大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的pdf打印文件,包含5544页,大小约为36mb.该文件由MS Word 2010创建,每个字母/文档上仅包含文本和徽标.

I have this large print file in pdf that's contains 5544 pages and is about 36mb in size. The file is created by MS Word 2010 and contains only text and a logo on each letter/document.

我根据关键字将其拆分为5544个文件,然后合并回2770个字母.每个字母约. 140-145kb.

I split it into 5544 files and merge back into 2770 letters, based on keywords. Each letter is approx. 140-145kb.

当我将所有字母合并到一个仍包含5544页的新pdf打印文件中时,文件的大小将增加到396mb.

When I merge all the letters into a new pdf print file, still containing 5544 pages, the size of the file is grown to 396mb.

所有文本的提取,拆分和合并都是通过从PHP调用Apache PDFBox命令行工具执行的,但是从控制台运行时,结果是相同的.

All text extracting, splitting and merging is performed with calls to Apache PDFBox command-line tools from PHP, but result is the same when run from a console.

有什么主意如何减小字母和最终打印文件的大小? 似乎PDFBox只是将每个字母附加在最终的打印文件中,而不是创建一个新的pdf文档.

Any idea how to reduce the file size of the letters and the final print file? It seems like PDFBox has just appended each letters in the final print file, instead creating a new pdf-document.

仅在测试阶段,所有文档都将合并到最终的打印文件中,其中一些文档将通过电子邮件发送.

It's only in the testing phase that all the documents are merged into the final print file, some of the documents will be send by email.

我也尝试过SAMBox(PDFBox的一个分支),但结果几乎相同:

I have also tried SAMBox (a fork of PDFBox) but with nearly the same result:

pdfinfo Original.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 36092281 bytes Optimized: no PDF version: 1.5

pdfinfo Original.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 36092281 bytes Optimized: no PDF version: 1.5

pdfinfo PDFBox.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 396622354 bytes Optimized: no PDF version: 1.4

pdfinfo PDFBox.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 396622354 bytes Optimized: no PDF version: 1.4

pdfinfo SAMBox.pdf Creator: Sejda Console 3.2.17 Producer: SAMBox 1.1.8 (www.sejda.org) ModDate: Tue Jul 11 23:34:33 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 378779436 bytes Optimized: no PDF version: 1.7

pdfinfo SAMBox.pdf Creator: Sejda Console 3.2.17 Producer: SAMBox 1.1.8 (www.sejda.org) ModDate: Tue Jul 11 23:34:33 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 378779436 bytes Optimized: no PDF version: 1.7

推荐答案

这听起来很可悲,但这是正确的.分割时,每个文件都会获取所需的资源(例如,字体和公司徽标图形).重新合并后,PDFBox并不知道在整个文档中它们可能是相同的,因此现在它们被重复很多.

That may sound sad but it is correct. When splitting, each file gets the resources (e.g. fonts and company logo graphic) it needs. When merged back, PDFBox does not know that these may be the same over the whole document, so these are now duplicated a lot.

我为您提供的唯一解决方案是使用PDFBox Java API一步创建邮件文件和最终打印文件,即无需创建合并回的单个文件.

The only solution I see for you would be to use the PDFBox java API to create the mailing files and the final print file in one step, i.e. without creating single files that are merged back.

这篇关于使用PDFBOX拆分和合并pdf文件会生成大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆