使用ImageMagick / Ghostscript时为什么转换此PDF文件失败? [英] Why converting this PDF file fails when using ImageMagick/Ghostscript?

查看:793
本文介绍了使用ImageMagick / Ghostscript时为什么转换此PDF文件失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想转换使用LaTeX编译的PDF文件(XeLaTeX引擎以便使用阿拉伯字体),我想将其上传到网络并防止其内容的复制和粘贴。

I want to convert this PDF file compiled with LaTeX (XeLaTeX engine so that to use an Arabic font) and I want to upload it to the web and prevent copy and paste of its content.

由于我正在寻找一个免费软件来做这件事,我遇到了两个强大的野兽来完成这项工作,即 ImageMagick Ghostscript 。我所需要的只是将一个文本PDF一次转换为图像PDF,如果可能的话,最好使用批处理(一次转换多个PDF)。

Since I am looking for a freeware to do that, I came across two powerful beasts to do this job, namely, ImageMagick and Ghostscript. All what I need is to convert one text PDF to image PDF in one go, preferably with batch processing if possible (to convert many PDFs in one go).

我在命令行中运行此代码,它适用于英文书写的PDF:

I run this code in command line and it works fine for English-written PDFs:

convert someenglish.pdf output.pdf  

现在当我为同一个人做同样的事情时阿拉伯语PDF我收到此错误:

Now when I do the same for an Arabic PDF I get this error:

convert.exe: PDFDelegateFailed `[ghostscript library] -q -dQUIET -dSAFER -dBATCH
 -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sD
EVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72"  "-sOutputFile
=C:/Users/doctorate/AppData/Local/Temp/magick-65203BNMxTDhXtkF%d" "-fC:/Users/doctorate/Ap
pData/Local/Temp/magick-65206AK54hOoKA62" "-fC:/Users/doctorate/AppData/Local/Temp/ma
gick-6520hDn-KMyTyxy2"':    **** Error reading a content stream. The page may be
 incomplete.
   **** Incorrect object count in object stream.
Error: /rangecheck in resolveobjectstream
Operand stack:
   78424   10   1   10   --dict:7/15(L)--   26   --nostringval--   35   --nostri
ngval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--
  --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict
:4/4(L)--   --dict:3/3(L)--   --dict:2/2(L)--   --nostringval--   --dict:7/7(L)-
-   --dict:10/10(L)--   --nostringval--   --nostringval--   Type   Font   Subtyp
e   CIDFontType2   BaseFont   MYCROL+(AH
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval-
-   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   fa
lse   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_
pop   1966   1   3   %oparray_pop   --nostringval--   --nostringval--   --nostri
ngval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--
  --nostringval--   --nostringval--
Dictionary stack:
   --dict:1193/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82
/200(L)--   --dict:116/127(ro)(G)--   --dict:280/300(ro)(G)--   --dict:24/32(L)-
-
Current allocation mode is local
GPL Ghostscript 9.15: Unrecoverable error, exit code 1
 @ error/pdf.c/InvokePDFDelegate/263.
convert.exe: no images defined `test.pdf' @ error/convert.c/ConvertImageCommand/
3210.

问题

我在这里缺少什么?我不是程序员,所以请在你的答案中考虑这一点。如果您能在批处理过程中展示如何执行此操作,我将非常感激。

Question
What am I missing here? I am not a programmer, so please consider this in your answer. I am very grateful if you could show how to do this in batch process.

注释


  • Windows 7 32位

  • Windows 7 32bit

Ghostscript版本9.15

Ghostscript version 9.15

图像质量对我来说不是问题72dpi会很好

Quality of image is not an issue for me even 72dpi will be fine

我想在输出大小和文本清晰度之间取得平衡。我只是希望文本在网络上可读,而不是用它进行一些OCR处理,因此图像不需要非常清晰。输出的大小越重要,越少越好,老实说我对什么可能效果更好一无所知;在这种情况下将PDF文件转换为PNG或JPEG。

I want to strike a balance between size of the output and clarity of text. I just want the text to be readable on the web and not to do some OCR processing with it, so image doesn't need to be very sharp. Size of output is more important, the less the better and honestly I am clueless as to what might works better; to convert the PDF file into PNG or into JPEG in this case.

我不希望将PDF格式化为多个连续命名的PNG或JPEG格式,只需将一个PDF格式化为另一个PDF格式,而是将其作为内部图像而不再是复制文件。易于粘贴的文字。

I don't want to burst a PDF into multiple serially named PNGs or JPEGs, simply one PDF to another PDF but as images inside and no more copy&paste-prone text.

更新

我试过制作一个最小的工作示例PDF来模仿原始PDF,并发现问题出现了,包括一个名为(AH)Manal Black 的阿拉伯字体。从此MWE上的命令行运行 pdffonts PDF给出:

Update
I tried to make a minimal working example PDF to mimic the original PDF and found that problem arises by including a certain Arabic font called (AH) Manal Black. Running pdffonts from command line on this MWE PDF gives:

Syntax Error (18062): Illegal character ')'
Syntax Error (18076): Dictionary key must be a name object
Syntax Error (18085): Dictionary key must be a name object
Syntax Error (18248): Illegal character ')'
Syntax Error (18248): Dictionary key must be a name object
Syntax Error (18253): Dictionary key must be a name object
Syntax Error (18599): Illegal character ')'
Syntax Error (18599): Dictionary key must be a name object
Syntax Error (18607): Dictionary key must be a name object
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
GAKHDJ+(AH                           CID TrueType      yes yes yes      5  0
HTCSVQ+Amiri-Regular                 CID TrueType      yes yes yes      7  0

通过使用LaTeX / XeTeX引擎编译文档时排除此阿拉伯字体,convert命令可以像在其他英文PDF中一样正常工作。所以很可能这个问题与解析字体有关。

By excluding this Arabic font when compiling the document using LaTeX/XeTeX engine, the convert command works just fine like in other English PDFs. So most probably this problem is linked to parsing of the fonts.

更新:最低限度的工作示例如下: https://www.dropbox.com/s/qdeuzips0ivas4q/mwe_ar.pdf?dl=0

Update: A minimally working example is here: https://www.dropbox.com/s/qdeuzips0ivas4q/mwe_ar.pdf?dl=0

推荐答案

最低限度的工作示例有PDF对象号。 10作为 ObjStm (对象流),可以找到这部分(我编辑了空白格式以提高可读性):

The minimally working example has PDF object no. 10 as an ObjStm (object stream), where this part can be found (I edited the whitespace formatting for improved readability):

<<  /Type               /Font
    /Subtype            /Type0
    /BaseFont           /GAKHDJ+#28AH)#20Manal#20Black
    /Encoding           /Identity-H
    /DescendantFonts    [4 0 R]
    /ToUnicode          12 0 R
>>

所以字体名称(AH)Manal Black ,正确地以十进制方式将空格转义为#20 ,并且左括号 as #28 ,但它没有以十进制括号的十六进制转义为#29 ,应该如此。

So the font name, (AH) Manal Black, has properly hex-escaped the blanks as #20 and the opening parenthesis ( as #28, but it hasn't hex-escaped the closing parenthesis ) as #29, as it should.

我不知道PDF生成过程的更多信息,我想创作者/制片人通过文件元数据给出的组合,

Without knowing more about the PDF generating process, I guess that the Creator/Producer combo as given through the file's metadata,

Creator:    XeTeX output 2015.05.01:1207
Producer:   xdvipdfmx (20140317)

应该受到指责。这是PDF生成软件中的一个错误......

is to be blamed. This is a bug in the PDF generating software...

也许我应该揭示我如何解剖和解压缩MWE PDF:

Maybe I should reveal how I dissected and uncompressed the MWE PDF:


  1. 尝试使用QPDF不起作用:

  1. Trying it with QPDF didn't work:

qpdf --qdf --object-streams=disable mwe_ar.pdf qdf.pdf

 object stream 10 (file position 585): unexpected )


  • 使用<$ c $尝试c> pdftk 也不起作用:

    pdftk mwe_ar.pdf cat pdftk.pdf uncompress
    
     Error: Unable to find file.
     Error: Failed to open PDF file: 
        mwe_ar.pdf
     Errors encountered.  No output created.
     Done.  Input errors, so no output created.
    


  • 尝试使用MuPDF的 mutool 也失败了:

    mutool clean -d mwe_ar.pdf mutool.pdf
    
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (1 0 R)
     warning: cannot load object (1 0 R) into cache
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (4 0 R)
     error: cannot load object (4 0 R) into cache
    


  • 最后,作为最后的手段, PeePDF.py 救援:

  • Finally, as a last resort, PeePDF.py to the rescue:

    $ cat peepdf-commands.txt
    
     object 10
    
    $ peepdf.py -s peepdf-commands.txt
    
      << /Length 1000
      /N 13
      /Type /ObjStm
      /Filter /FlateDecode
      /First 84 >>
      stream
      9 0 3 72 11 133 2 197 1 312 15 343 4 446 14 625 19 876 6 1344 18 1514 5 1758 7 1886 <</Font<</F1 5 0 R/F2 7 0 R>>/ProcSet[/PDF/Text/ImageC/ImageB/ImageI]>>
      <</Resources 9 0 R/Type/Page/Parent 11 0 R/Contents[8 0 R]>>
      <</Type/Pages/Count 1/Kids[3 0 R]/MediaBox[0 0 595.28 841.89]>>
      <</Creator( XeTeX output 2015.05.01:1207)/Producer(xdvipdfmx \(20140317\))/CreationDate(D:20150501120749+01'00')>>
      <</Pages 11 0 R/Type/Catalog>>
      [417[251]421[257]424[368]443[470]445[355]450[380]480[322]498[480 233]505[461]508[256]514[326]520[264]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/FontDescriptor 14 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 199/W 15 0 R>>
      <</Type/FontDescriptor/Ascent 529/Descent -415/StemV 109/CapHeight 529/AvgWidth 392/FontBBox[-112 -321 1006 1137]/ItalicAngle 0/Flags 6/Style<</Panose<000000000000000000000000>>>/FontName/GAKHDJ+#28AH)#20Manal#20Black/FontFile2 16 0 R/CIDSet 17 0 R>>
      [39[693]41[522]51[535]108[415]124[415]388[218 926]402[1213]406[541]446[586]1886[317]1992[229]2016[366]2021[366]2105[244]2108[244]2139[1006]2150[295]2162[378]2227[379 452]2272[589]2294[176]2300[198]2308[389]2339[343]2356[723]2359[1079]2397[552]2413[346]2457[177]2491[299]2912[349]2952[219]2969[209]2973[148]2976[302]2981[341]3027[168]3149[550]3297[259]3325[292]3726[248]3732[319]3853[411]3893[179]4021[55]4323[104]4627[560]5068[238]5106[476]5322[159]5328[222]6366[93]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/HTCSVQ+Amiri-Regular/FontDescriptor 18 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 190/W 19 0 R>>
      <</Type/FontDescriptor/Ascent 1123/Descent -635/StemV 87/CapHeight 1123/AvgWidth 685/FontBBox[-581 -900 11467 1815]/ItalicAngle 0/Flags 6/Style<</Panose<000000000500000000000000>>>/FontName/HTCSVQ+Amiri-Regular/FontFile2 20 0 R/CIDSet 21 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/Encoding/Identity-H/DescendantFonts[4 0 R]/ToUnicode 12 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/HTCSVQ+Amiri-Regular/Encoding/Identity-H/DescendantFonts[6 0 R]/ToUnicode 13 0 R>>
    
      endstream
    


  • 我使用PeePDF.py的次数越多,我就越喜欢它。谢谢,何塞米格尔,这个奇妙的工具!

    The more often I use PeePDF.py, the more I love it. Thanks, Jose Miguel, for this wonderful tool!

    这篇关于使用ImageMagick / Ghostscript时为什么转换此PDF文件失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆