优化PDF文件(使用Ghostscript或其他) [英] Optimize PDF files (with Ghostscript or other)

查看:572
本文介绍了优化PDF文件(使用Ghostscript或其他)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果要优化PDF文件并缩小文件大小,Ghostscript是否是最佳选择?



我需要存储大量的PDF文件,因此我需要优化并尽可能减少文件大小



有没有人有任何Ghostscript和/或其他的经验?



< h1>命令行

  exec('gs -dNOPAUSE -dBATCH -sDEVICE = pdfwrite -dCompatibilityLevel = 1.4 
-dPDFSETTINGS = / screen -sOutputFile ='。$ file_new。''。$ file);


解决方案

如果你寻找一个免费')软件,Ghostscript肯定是你最好的选择。然而,它并不总是易于使用 - 它的一些(非常强大的)处理选项不容易找到记录。



看看这个答案,哪个解释了如何执行比通用的 -dPDFSETTINGS = / screen 所做的更精确的图像分辨率下采样控制(定义了一些默认值,您可能需要覆盖) :





基本上,它告诉你如何使Ghostscript将所有图像缩小到一个分辨率的72dpi(这个值是什么 -dPDFSETTINGS = / screen 使用 - 你可能想要走得更低):

  -dDownsampleColorImages = true \ 
-dDownsampleGrayImages = true \
-dDownsampleMonoImages = true \
-dColorImageResolution = 72 \
-dGrayImageResolution = 72 \
-dMonoImageResolution = 72 \

如果您想尝试Ghostscript是否也可以un-embed使用的字体(有时它有效,有时不符合嵌入式字体的复杂性,并且 使用的字体类型 ),您可以尝试将以下内容添加到您的gs命令中:

  gs \ 
-o output.pdf \
[...其他选项...] \
-dEmbedAllFonts = false \
-dSubsetFonts = true \
-dConvertCMYKImagesToRGB = true \
-dCompressFonts = true \
-c.setpdfwrite< / AlwaysEmbed []>> setdistillerparams\
-c.setpdfwrite<< / NeverEmbed [/ Courier / Courier-Bold / Courier-Oblique / Courier-BoldOblique / Helvetica / Helvetica-Bold / Helvetica-Oblique / Helvetica-BoldOblique / Times -Roman / Times-Bold / Times-Italic / Times-BoldItalic / Symbol / ZapfDingbats / Arial]>> setdistillerparams\
-f input.pdf

注意:请注意,下采样图像分辨率肯定会降低质量(不可逆转),并且嵌入字体将使得难以或不可能显示和打印PDF,除非机器上安装了相同的字体....






更新



我原来答案中忽略的一个选项是添加

  -dDetectDuplicateImages = true 

到命令行,这个参数会导致Ghostscript尝试并检测多次嵌入PDF中的任何图像,如果将图像用作徽标或页面背景,则可能会发生这种情况,如果PDF生成软件没有针对这种情况进行优化,以前是旧版本的OpenOffice / LibreOffice的情况(我测试了最新版本的LibreOffice v4.3.5.2,它不再做这样的愚蠢的事情) / p>

如果连接PDF文件与 pdftk 的帮助也会发生。为了显示效果,以及如何发现它,我们来看一下PDF文件示例:

  pdfinfo p1.pdf 

制作人:libtiff / tiff2pdf - 20120922
创建日期:Tue Jan 6 19:36:34 2015
ModDate:Tue Jan 6 19:36:34 2015
标签:否
UserProperties:否
可疑:否
表单:无
JavaScript:否
页数:1
加密:否
页面大小: 595 x 842 pts(A4)
页面rot:0
文件大小:20983字节
优化:无
PDF版本:1.1

最近版本的Poppler的 pdfimages 实用程序已经添加了对列表的支持参数,可以列出PDF文件中包含的所有图像:

  pdfimages -list p1.pdf 

页数num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
------------------- ----------------- --------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 7 0 52 52 19.2K 2.6%

此示例PDF是一个1页的文档,包含使用JPEG压缩压缩的图像,宽度为423像素,高度为600像素,并以页面上的52 PPI的分辨率呈现。



如果我们在 pdftk 的帮助下连接此文件的3个副本,如下所示:

  pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf 

然后结果通过 pdfimages -list 显示这些图像属性:

  pdfimages -list p3.pdf 

页面num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
----- -------------------------------------------------- -------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 4 0 52 52 19.2 K 2.6%
2 1图像423 600 rgb 3 8 jpeg no 8 0 52 52 19.2K 2.6%
3 2图像423 600 rgb 3 8 jpeg no 12 0 52 52 19.2K 2.6%

这表明有3个相同的PDF对象(ID为4,8和12) code> p3.pdf 现在。 p3.pdf 由3页组成:

  pdfinfo p3.pdf | grep页面:

页面:3



通过将重复的图像替换为优化PDF参考文献



现在我们可以在Ghostscript的帮助下应用上述优化

  gs -o p3-optim.pdf -sDEVICE = pdfwrite -dDetectDuplicateImages = true p3.pdf 

检查:

  pdfimages -list p3-optim.pdf 

页面num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
---------------------------------- -------------------------------------------------- -
1 0 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
2 1图像423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
3 2 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%

还有每页列出一个图像,但PDF对象ect ID现在一直是一样的:10。

  ls -ltrh p1.pdf p3.pdf p3-optim.pdf 

-rw-r - r - @ 1 kp staff 20K Jan 6 19:36 p1.pdf
-rw-r - r-- 1 kp staff 60K Jan 6 19:37 p3.pdf
-rw-r - r-- 1 kp工作人员16K Jan 6 19:40 p3-optim.pdf

如您所见,使用pdftk进行的哑连接将原始文件大小增加到原始文件大小的三倍。 Ghostscript的优化将其降低了很多。



最新版本的Ghostscript甚至可以应用 -dDetectDuplicateImages 默认。 (AFAIR,v9.02,首次介绍它,默认情况下没有使用它。)


Is Ghostscript the best option if you want to optimize a PDF file and reduce the file size?

I need to store alot of PDF files and therefore I need to optimize and reduce the file size as much as possible

Does anyone have any experience with Ghostscript and/or other?

command line

exec('gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/screen -sOutputFile='.$file_new.' '.$file);

解决方案

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.

Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):

Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):

-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \

If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:

gs \
  -o output.pdf \
   [...other options...] \
  -dEmbedAllFonts=false \
  -dSubsetFonts=true \
  -dConvertCMYKImagesToRGB=true \
  -dCompressFonts=true \
  -c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" \
  -c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" \
  -f input.pdf

Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....


Update

One option which I had overlooked in my original answer is to add

-dDetectDuplicateImages=true

to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).

It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:

pdfinfo p1.pdf

 Producer:       libtiff / tiff2pdf - 20120922
 CreationDate:   Tue Jan  6 19:36:34 2015
 ModDate:        Tue Jan  6 19:36:34 2015
 Tagged:         no
 UserProperties: no
 Suspects:       no
 Form:           none
 JavaScript:     no
 Pages:          1
 Encrypted:      no
 Page size:      595 x 842 pts (A4)
 Page rot:       0
 File size:      20983 bytes
 Optimized:      no
 PDF version:    1.1

Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:

pdfimages -list p1.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image    423   600   rgb    3   8 jpeg     no     7  0    52    52 19.2K 2.6%

This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.

If we concatenate 3 copies of this file with the help of pdftk like so:

pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf

then the result shows these image properties via pdfimages -list:

pdfimages -list p3.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no     4  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no     8  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    12  0    52    52 19.2K 2.6%

This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:

pdfinfo p3.pdf | grep Pages:

 Pages:          3

Optimize PDF by replacing duplicate images with references

Now we can apply the above mentioned optimization with the help of Ghostscript

 gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf

Checking:

 pdfimages -list p3-optim.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%

There is still one image listed per page -- but the PDF object ID is always the same now: 10.

 ls -ltrh p1.pdf p3.pdf p3-optim.pdf

   -rw-r--r--@ 1 kp  staff    20K Jan  6 19:36 p1.pdf
   -rw-r--r--  1 kp  staff    60K Jan  6 19:37 p3.pdf
   -rw-r--r--  1 kp  staff    16K Jan  6 19:40 p3-optim.pdf

As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.

The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

这篇关于优化PDF文件(使用Ghostscript或其他)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆