优化 PDF 文件(使用 Ghostscript 或其他) [英] Optimize PDF files (with Ghostscript or other)

查看:51
本文介绍了优化 PDF 文件(使用 Ghostscript 或其他)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您想优化 PDF 文件并减小文件大小,Ghostscript 是最佳选择吗?

Is Ghostscript the best option if you want to optimize a PDF file and reduce the file size?

我需要存储大量的PDF文件,因此我需要尽可能地优化和减小文件大小

I need to store alot of PDF files and therefore I need to optimize and reduce the file size as much as possible

有人有使用 Ghostscript 和/或其他方面的经验吗?

Does anyone have any experience with Ghostscript and/or other?

exec('gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/screen -sOutputFile='.$file_new.' '.$file);

推荐答案

如果您正在寻找免费(如libre")软件,Ghostscript 无疑是您的最佳选择.然而,它并不总是易于使用——它的一些(非常强大的)处理选项不容易找到文档.

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.

看看这个答案,它解释了如何对图像分辨率下采样执行比通用 -dPDFSETTINGS=/screen 所做的更详细的控制(它定义了一些总体默认值,您可能想要覆盖):

Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):

基本上,它告诉您如何使 Ghostscript 将所有图像下采样到 72dpi 的分辨率(该值是 -dPDFSETTINGS=/screen 使用的 - 您可能想要更低):

Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):

-dDownsampleColorImages=true 
-dDownsampleGrayImages=true 
-dDownsampleMonoImages=true 
-dColorImageResolution=72 
-dGrayImageResolution=72 
-dMonoImageResolution=72 

如果您想尝试一下 Ghostscript 是否还能够取消嵌入"所使用的字体(有时可以,有时不行——这取决于嵌入字体的复杂性,以及 关于使用的字体类型),您可以尝试将以下内容添加到您的 gs 命令中:

If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:

gs 
  -o output.pdf 
   [...other options...] 
  -dEmbedAllFonts=false 
  -dSubsetFonts=true 
  -dConvertCMYKImagesToRGB=true 
  -dCompressFonts=true 
  -c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" 
  -c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" 
  -f input.pdf

注意:请注意,图像分辨率下采样肯定会降低质量(不可逆转),并且反嵌入字体将使显示和打印 PDF 变得困难或不可能,除非在机器....

Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....

我在原始答案中忽略的一个选项是添加

One option which I had overlooked in my original answer is to add

-dDetectDuplicateImages=true

到命令行.此参数使 Ghostscript 尝试检测多次嵌入 PDF 中的任何图像.如果您使用图像作为徽标或页面背景,并且 PDF 生成软件没有针对这种情况进行优化,就会发生这种情况.旧版本的 OpenOffice/LibreOffice 曾经是这种情况(我测试了最新版本的 LibreOffice,v4.3.5.2,它不再做这种愚蠢的事情).

to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).

如果您在 pdftk 的帮助下连接 PDF 文件,也会发生这种情况.为了向您展示效果以及如何发现它,让我们看一个示例 PDF 文件:

It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:

pdfinfo p1.pdf

 Producer:       libtiff / tiff2pdf - 20120922
 CreationDate:   Tue Jan  6 19:36:34 2015
 ModDate:        Tue Jan  6 19:36:34 2015
 Tagged:         no
 UserProperties: no
 Suspects:       no
 Form:           none
 JavaScript:     no
 Pages:          1
 Encrypted:      no
 Page size:      595 x 842 pts (A4)
 Page rot:       0
 File size:      20983 bytes
 Optimized:      no
 PDF version:    1.1

Poppler 的 pdfimages 实用程序的最新版本增加了对 -list 参数的支持,该参数可以列出 PDF 文件中包含的所有图像:

Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:

pdfimages -list p1.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image    423   600   rgb    3   8 jpeg     no     7  0    52    52 19.2K 2.6%

此示例 PDF 是一个单页文档,其中包含一张图像,该图像使用 JPEG 压缩技术进行压缩,宽度为 423 像素,高度为 600 像素,并以 52 PPI 的分辨率在页面上呈现.

This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.

如果我们像这样在 pdftk 的帮助下连接这个文件的 3 个副本:

If we concatenate 3 copies of this file with the help of pdftk like so:

pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf

然后结果通过pdfimages -list显示这些图像属性:

then the result shows these image properties via pdfimages -list:

pdfimages -list p3.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no     4  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no     8  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    12  0    52    52 19.2K 2.6%

这表明现在 p3.pdf 中嵌入了 3 个相同的 PDF 对象(ID 为 4、8 和 12).p3.pdf 共 3 页:

This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:

pdfinfo p3.pdf | grep Pages:

 Pages:          3

通过用引用替换重复图像来优化 PDF

现在我们可以在 Ghostscript 的帮助下应用上述优化

Optimize PDF by replacing duplicate images with references

Now we can apply the above mentioned optimization with the help of Ghostscript

 gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf

检查:

 pdfimages -list p3-optim.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%

每页仍列出一个图像 -- 但现在 PDF 对象 ID 始终相同:10.

There is still one image listed per page -- but the PDF object ID is always the same now: 10.

 ls -ltrh p1.pdf p3.pdf p3-optim.pdf

   -rw-r--r--@ 1 kp  staff    20K Jan  6 19:36 p1.pdf
   -rw-r--r--  1 kp  staff    60K Jan  6 19:37 p3.pdf
   -rw-r--r--  1 kp  staff    16K Jan  6 19:40 p3-optim.pdf

如您所见,使用 pdftk 进行的愚蠢"连接将原始文件大小增加到原始文件的三倍.Ghostscript 的优化使其性能下降了相当多.

As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.

最新版本的 Ghostscript 甚至可能默认应用 -dDetectDuplicateImages.(AFAIR,v9.02,第一次引入,默认没有使用.)

The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

这篇关于优化 PDF 文件(使用 Ghostscript 或其他)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆