使用ghostscript处理(重新映射)PDF中缺失/有问题的(CID / CJK)字体? [英] Handling (remapping) missing/problematic (CID/CJK) fonts in PDF with ghostscript?

查看:2135
本文介绍了使用ghostscript处理(重新映射)PDF中缺失/有问题的(CID / CJK)字体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简而言之,我正在处理一个有问题的PDF,它是:
$ b

  • 无法在文档查看器中完全呈现,如 evince ,因为缺少字体信息;然而, ghostscript 完全可以呈现相同的PDF。


    因此 - 无论 ghostscript 用于填充空格可能后备符号,或一种不同的方法来访问字体) - 我想能够使用 ghostscript 产生(distill )一个输出PDF,其中几乎没有什么将被改变,除了添加的字体信息,所以 evince 可以以相同的方式呈现相同的文档 ghostscript 可以。

    我的问题是这样吗?如果是这样的话,命令行是什么呢?



    非常感谢您提供任何答案,

    干杯!




    详细信息:

    实际上,我使用的是旧版本的Ubuntu 10.04,而且可能会遇到 - 而不是一个错误 - evince (缺少 poppler-data 包),如本身 - 观察相同的页面,相同的文档:

      $ gs -sDevice = x11 -g740x450 -r150x150 -dFirstPage = 3 \ 
    -c'<< / PageOffset [-120 520]>> ; setpagedevice'\
    -f fontspec.pdf

    GPL Ghostscript 9.02(2011-03-30)
    版权所有(C)2010 Artifex软件公司保留所有权利。
    此软件不附带任何担保:有关详细信息,请参阅PUBLIC文件。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    处理第3至74页。
    Page 3
    >> showpage,按<返回>继续<<
    $ C

    ...以及输出结果:





     
    $ b

    结论: ghostscript (和 显然 由扩展名为 imagemagick )可以看起来像是找到了丢失的字体(或者至少替换了某个字体),然后渲染一个页面 - 即使 evince 在同一个文件上失败。因此,我只想从 ghostscript 导出一个PDF版本,这样只有缺失的字体嵌入,而不需要其他处理;所以我试试这个:

    $ $ $ $ gs -dBATCH -dNOPAUSE -dSAFER \
    -dEmbedAllFonts -dSubsetFonts = true - dMaxSubsetPct = 99 \
    -dAutoFilterMonoImages = false \
    -dAutoFilterGrayImages = false \
    -dAutoFilterColorImages = false \
    -dDownsampleColorImages = false \
    - dDownsampleGrayImages = false \
    -dDownsampleMonoImages = false \
    -sDEVICE = pdfwrite \
    -dFirstPage = 3 -dLastPage = 3 \
    -sOutputFile = mypg3out.pdf - f fontspec.pdf

    GPL Ghostscript 9.02(2011-03-30)
    版权所有(C)2010 Artifex软件公司保留所有权利。
    此软件不附带任何担保:有关详细信息,请参阅PUBLIC文件。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    ****警告:将0000000000 00000 n视为免费条目。
    处理第3页到第3页。
    Page 3

    ****这个文件的错误已被修复或被忽略。
    ****档案由以下产生:
    ****>>>> Mac OS X 10.5.4 Quartz PDFContext<<<<<
    ****请通知产生这个
    ****文件的软件的作者,它不符合Adobe公布的PDF
    ****规范。

    ...但是不起作用 - 输出文件 mypg3out。如前所述,pdf evince 中存在完全相同的问题。



    注意:虽然我想避免postscript往返,一个很好的例子 gs 命令行与从PDF到PS与字体嵌入在这里:(#277826)pdf - 如何使GhostScript PS2PDF停止子集字体;但同样的命令行开关.pdf到.pdf似乎对上述问题没有任何影响。

    解决方案

    OK点1;你不用使用Ghostscript和pdfwrite创建一个PDF文件'没有任何额外的处理'。



    pdfwrite和Ghostscript的工作方式是完全的解释传入的数据(PostScript,PDF,XPS,PCL等等),创建一系列传递给pdfwrite设备的图形原语。然后,PDFwrite设备重新组装成一个全新的PDF文件。

    所以它不可能把一个PDF文件作为输入和操纵它,它总是会创建一个新的文件。

    现在,我建议你将9.02 Ghostscript升级到9.05。缺失的CIDFonts在9.05中处理得更好(今年晚些时候将在9.06中进一步改进)。 (您错过的'Osaka Mono'字体实际上是一个CIDFont,而不是一个普通的字体)

    使用目前流行的Ghostscript代码为我生成一个PDF文件,嵌入了缺少的字体。我不知道这是否适合你,因为我的evince副本完美地呈现原始文件。



    稍后添加



    检查原始的PDF文件,我发现那里的字体确实是嵌入的(就像我期望的那样,因为它们是子集)。所以事实上,正如你在上面自己的回答中所说的,问题不在于字体嵌入,而在于使用CIDFonts。

    我的回答在这里不会帮助你,因为pdfwrite在输出中仍然会产生一个CIDFont。基本上这是你的版本或安装evince的缺陷。

    重新映射字符的问题是字体限制为256个字形,而CIDFont实际上没有限制。所以没有办法把一个CIDFont放到一个Font中。唯一的方法是创建多个字体,每个字体都包含原始的一部分,然后根据需要在它们之间切换。慢和klunky。

    如果您使用ps2write设备转换为PostScript,那么它会为您做到这一点,但是你承担很大的风险,在这个过程中,它将转换矢量字形数据转换成位图,这将不能很好地扩展。

    基本上你不能用Ghostscript来实现你想做的事(把1个CIDFont转换成N个常规字体) ,或者实际上与我所知道的任何其他工具。尽管技术上可行,但所有PDF使用者都应该能够处理CIDFonts,所以没有真正的意义。如果他们不能那么在PDF消费者的错误。


    In brief, I'm dealing with a problematic PDF, which:

    • Cannot be fully rendered in a document viewer like evince, because of missing font information;
    • However - ghostscript can fully render the same PDF.

    Thus -- regardless of what ghostscript uses to fill in the blanks (maybe fallback glyphs, or a different method to accessing fonts) -- I'd like to be able to use ghostscript to produce ("distill") an output PDF, where pretty much nothing will be changed, except font information added, so evince can render the same document in the same manner as ghostscript can.

    My question is thus - is this possible to do at all; and if so, what would be command line be to achieve something like that?

    Many thanks in advance for any answers,
    Cheers!


    Details:

    I'm actually on an older Ubuntu 10.04, and I might be experiencing - not a bug - but an installation problem with evince (lack of poppler-data package), as noted in Bug #386008 "Some fonts fail to display due to "Unknown font tag..." : Bugs : "poppler" package : Ubuntu.

    However, that is exactly what I'd like to handle, so I'll use the fontspec.pdf attached to that post ("PDF triggering the bug.", // v.) to demonstrate the problem.

    evince

    First, I open this pdf's page 3 in evince; and evince complains:

    $ evince --page-label=3 fontspec.pdf
    
    Error: Missing language pack for 'Adobe-Japan1' mapping
    Error: Unknown font tag 'F5.1'
    Error (7597): No font in show
    Error: Unknown font tag 'F5.1'
    Error (7630): No font in show
    Error: Unknown font tag 'F5.1'
    Error (7660): No font in show
    Error: Unknown font tag 'F5.1'
    ...
    

    The rendering looks like this:

    ... and it is obvious that some font shapes are missing.

    Adobe acroread

    Just a note on how Adobe's Acrobat Reader for Linux behaves; the following command line:

    $ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf
    

    ... generates no output to terminal whatsoever (for more on /a switch, see Man page acroread) -- and the program has absolutely no problem displaying the fonts.

    Also, while I'd like to avoid the roundtrip to postscript - however, note that acroread itself can be used to convert a PDF to postscript:

    $ ./Adobe/Reader9/bin/acroread -v
    9.5.1
    
    $ ./Adobe/Reader9/bin/acroread -toPostScript \ 
    -rotateAndCenter -choosePaperByPDFPageSize \
    -start 3 -end 3 \
    -level3 -transQuality 5 \
    -optimizeForSpeed -saveVM \
    fontspec.pdf ./ 
    

    Again, the above command line will generate no output to terminal; -optimizeForSpeed -saveVM are there because apparently they deal with fonts; the last argument ./ is the output directory (output file is automatically called fontspec.ps).

    Now, evince can display the previously missing fonts in the fontspec.ps output - but again complains:

    $ evince fontspec.ps 
    GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
    GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
    ...
    

    ... and furthermore, all text seems to be flattened to curves in the postscript - so now one cannot select the text in the .ps file in evince anymore (note that the .ps file cannot be opened in acroread). However, one can convert this .ps back into .pdf again:

    $ pstopdf fontspec.ps   # note, `pstopdf` has no output filename option;
                            # it will automatically choose 'fontspec.pdf',
                            # and overwrite previous 'fontspec.pdf' in 
                            # the same directory 
    

    ... and now the text in the output of pstopdf is selectable in evince, all fonts are there, and evince doesn't complain anymore. However, as I noted, I'd like to avoid roundtrip to postscript files altogether.

    display (from imagemagick)

    We can also observe the page in the same document with imagemagicks display (note that image panning from the commandline using 'display' is apparently still not available, so I've used -crop below to adjust the viewport):

    $ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
       **** Warning: considering '0000000000 00000 n' as a free entry.
    ...
       **** This file had errors that were repaired or ignored.
       **** The file was produced by: 
       **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
       **** Please notify the author of the software that produced this
       **** file that it does not conform to Adobe's published PDF
       **** specification.
    

    ... which generates some ghostscripish errors - and results with something like this:

    ... where it's obvious that the missing fonts that evince couldn't render, are now shown here, with imagemagicks display, properly.

    ghostscript

    Finally, we can use ghostscript as x11 viewer itself -- to observe the same page, same document:

    $ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
    -c '<</PageOffset [-120 520]>> setpagedevice' \
    -f fontspec.pdf
    
    GPL Ghostscript 9.02 (2011-03-30)
    Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
    This software comes with NO WARRANTY: see the file PUBLIC for details.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
    Processing pages 3 through 74.
    Page 3
    >>showpage, press <return> to continue<<
    ^C
    

    ... and results with this output:

     

    In conclusion: ghostscript (and apparently by extension, imagemagick) can seemingly find the missing font (or at least some replacement for it), and render a page with that -- even if evince fails at that for the same document.

    I would, therefore, simply like to export a PDF version from ghostscript, that would have only the missing fonts embedded, and no other processing; so I try this:

    $ gs -dBATCH -dNOPAUSE -dSAFER  \
    -dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
    -dAutoFilterMonoImages=false \
    -dAutoFilterGrayImages=false \
    -dAutoFilterColorImages=false \
    -dDownsampleColorImages=false \
    -dDownsampleGrayImages=false \
    -dDownsampleMonoImages=false \
    -sDEVICE=pdfwrite \
    -dFirstPage=3 -dLastPage=3 \
    -sOutputFile=mypg3out.pdf -f fontspec.pdf
    
    GPL Ghostscript 9.02 (2011-03-30)
    Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
    This software comes with NO WARRANTY: see the file PUBLIC for details.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
       **** Warning: considering '0000000000 00000 n' as a free entry.
    Processing pages 3 through 3.
    Page 3
    
       **** This file had errors that were repaired or ignored.
       **** The file was produced by:
       **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
       **** Please notify the author of the software that produced this
       **** file that it does not conform to Adobe's published PDF
       **** specification.
    

    ... but it doesn't work - the output file mypg3out.pdf suffers from the exact same problems in evince as noted previously.

    Note: While I'd like to avoid the postscript roundtrip, a good example of gs command line with from pdf to ps with font embedding is here: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts; but the same command line switches for .pdf to .pdf to not seem to have any effect on the problem described above.

    解决方案

    OK point 1; you CANNOT use Ghostscript and pdfwrite to create a PDF file 'without any additional processing'.

    The way that pdfwrite and Ghostscript work is to fully interpret the incoming data (PostScript, PDF, XPS, PCL, whatever), creating a series of graphics primitives, which are passed to the pdfwrite device. The pdfwrite device then reassembles these into a brand new PDF file.

    So its not possible to take a PDF file as input and manipulate it, it will always create a new file.

    Now, I would suggest that you upgrade your 9.02 Ghostscript to 9.05 to start with. Missing CIDFonts are much better handled in 9.05 (and will be further improved in 9.06 later this year). (The font you are missing 'Osaka Mono' is in fact a CIDFont, not a regular font)

    Using the current bleeding edge Ghostscript code produces a PDF file for me which has the missing font embedded. I can't tell if this will work for you because my copy of evince renders the original file perfectly well.

    Added later

    Examining the original PDF file I see that the fonts there are indeed embedded (as I would expect, since they are subsets). So in fact as you say in your own answer above, the problem is not font embedding, but the use of CIDFonts.

    My answer here will not help you, as pdfwrite will still produce a CIDFont in the output. Basically this is a flaw in your version or installation of evince.

    The problem with 'remapping' the characters is that a font is limited to 256 glyphs, while a CIDFont has effectively no limit. So there is no way to put a CIDFont into a Font. The only way to do this would be to create multiple Fonts each of which contained a portion of the original, and then switch between them as required. Slow and klunky.

    If you convert to PostScript using the ps2write device then it will do this for you, but you stand a great risk that in the process it will convert the vector glyph data into bitmaps, which will not scale well.

    Fundamentally you can't really achieve what you want to do (convert 1 CIDFont into N regular Fonts) with Ghostscript, or in fact with any other tool that I know of. While its technically possible, there is no real point since all PDF consumers should be able to handle CIDFonts. If they can't then its a bug in the PDF consumer.

    这篇关于使用ghostscript处理(重新映射)PDF中缺失/有问题的(CID / CJK)字体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆