PDF字体编码 - 为什么我不能从PDF复制文本? [英] PDF Font encoding -- why can't I copy text from a PDF?

查看:288
本文介绍了PDF字体编码 - 为什么我不能从PDF复制文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

转换PDF文件后,我无法再复制文本。



我所得到的都是无法读取的字符,没有任何意义。



该字体是法语脚本MT,但编码是自定义的(请参阅 Adob​​e Reader>文件>文档属性> Font 的结果) / p>

(光学字符识别)软件。 然而,在这种情况下,使用的字体类型('em''French Script MT')对于OCR将无法正常工作。


  • 从理论上说,应该可以将PDF代码添加到现有的PDF文件中,实际上增加了缺少的 / ToUnicode 表。我不知道任何可以自动执行的工具。添加这个将涉及:




    • 重新设计PDF文件,然后

    • 编写然后

    • 将它作为一个separte PDF对象插入PDF文件的正确位置,然后

    • 插入一个指向该表格的条目将表格转换为字体对象的字典,最后一个

    • 将PDF文件的 xref 表格更新为所有对象ID的字节偏移量,受到更改的影响。




  • 简历 - 我的建议: h2>



    1. 重新创建PDF。

    2. 如果可能,将您的PDFCreator在更新版本的Ghostscript。

    3. 更改PDFCreator的设置,使其不再创建一个字体子集。确保原始字体是 完全嵌入


    然后很可能,字体编码问题会消失,你会可以复制您的PDF文本。






    更新:



    我创建了5个示例手工编码的PDF文件,其中显示了由缺少或不正确/操纵的 / ToUnicode 表中的PDF。这些示例现在致力于我们最近创建的 GitHub存储库 ,致力于提供示例PDF文件,可以通过查看其源代码来学习,学习和探索PDF语法。这5个文件位于子目录'textextract'




    After converting a PDF file, I cannot copy the text from it any more.

    All I get are unreadable characters which make no sense whatsoever.

    The font is French Script MT, but the encoding is custom (see result of Adobe reader > File > Document properties > Font).

    Here is the PDF.

    I tried several methods... editing in PDF editor; notepad++; Word; Acrobat Pro.

    • Is there anything wrong with the PDF file's source code which prevents the correct copying of text elements?

    • Can this PDF's source code possibly be changed/modified/amended so that copy+pasting text would work?

    解决方案

    I've looked at your file using different tools:

    • qpdf (by Jay Berkenbilt) to analyze the file.
    • pdfid.py and pdf-parser.py (by Didier Stevens) to analyze more.
    • PDFlib's TET (text extraction tool) to try and extract text.
    • PDFlib's Font Reporter Acrobat{, Reader} plugin to generate a table with glyphs used by the PDF.
    • Poppler's pdffonts command line utility.

    Even TET failed to extract the text. And TET is the best I know for this task -- it often succeeds where other methods fail.

    My analysis gave me the following results:

    1. pdffonts gives a first quick overview. It returns the following info:

      $ pdffonts "so#12703387-problem.pdf" 
      
         name                      type         encoding    emb sub uni object ID
         ------------------------- ------------ ----------- --- --- --- ---------
         YLWHHJ+FrenchScriptMT     Type 1C      Custom      yes yes no      14  0
      

      The column uni should contain a yes entry. The no in that column indicates that a /ToUnicode table is missing in the font used by the PDF. That font is embedded as a subset under the name YLWHHJ+FrenchScriptMT. It also uses a Custom font encoding (most likely using a /Differences array). Without a correct and complete /ToUnicode table it will be impossible to extract the text correctly.

    2. The PDF creator used to generate this PDF was PDFCreator Version 1.0.2 based on the very old version of Ghostscript 8.70. (This is revealed by running "pdfinfo so#12703387-problem.pdf".)

    3. The font used is a subset of FrenchScriptMT, containing 94 different glyphs.

    4. The font encoding is "Custom", using a /Differences array.

    5. The text drawing in the PDF predominantly uses the operator TJ, which allows individual glyph positioning.

    6. All text drawing operations make extensive use of the 'individual glyph positioning' feature. Nearly all glyphs are positioned individually, as you can see from this code snippet (first occurrence of TJ):

      [<01>-3.18894<02>3.62397<02>3.62397<03>-2.42535<04>3.12889<05>3.88047<06>
      -14.1669<07>-3.7221<02>-4.37556<04>3.62397(\b)-4.88286(\t)3.88047<01>
      -3.18874(\n)1.29105<06>-13.6718(\b)-4.88245<0b>1.78573<02>3.1293<06>
      -21.6714<04>3.62438(\f)0.553714(\r)0.0464142<0e>-1.28494<0f>-0.448671<10>
      3.88007<06>-21.6714(\b)-4.88245<0b>1.78573<02>3.1293<06>-13.6718<11>
      0.0920142<02>-4.37515<04>3.62438(\b)2.622<06>-13.6718<03>
      -10.4245(\t)3.88007<11>0.0920142<02>3.62438<12>-6.14134(\b)3.11708<13>
      3.3858<14>0.0455999<15>-7.42628<06>-14.1669<16>2.90048(\r)0.0455999<17>
      -1.88425(\r)0.0455999<0b>1.78654(\r)]TJ
      

    7. As can be seen from '6.', the text drawing operations do not use 'a sequence of literal characters enclosed in parentheses ( )', but use 'hexadecimal data enclosed in angle brackets < >' (see PDF spec, chapter 7.3.4.1).

    8. The hex values for the character names do not match easily to character names (though they are supposed to be derived from WinAnsiEncoding).

      One has to lookup the custom encoding table for it first.

      I used the command pdf-parser.py -s encoding so#12703387-problem.pdf for this. Result:

          <<
             /Type /Encoding
             /BaseEncoding /WinAnsiEncoding
              /Differences [
           1
              /g81 /g72
              /g71 /g86
              /g30 /g3
              /g53 /g87
              /g76 /g74
                   (... skipping some lines of output ...)
              /g32 /g170
              /g105 /g103
              /g95 ]
            >>
      

    9. Now that last point exposes the crux of the matter: The font's encoding table does not use standard character names. Instead it uses 1, /g81, /g72, ... /g95 (altogether 94 different names).

    10. My last point about the glyph names is also confirmed by the results of PDFlib's FontReporter plugin:

    11. No automatically working tool for text extraction (none that I know of, at least) could make heads and tails of this mess. A human expert could, but I didn't even try (because it wouldn't help you much -- see my resume for better help).

    12. The best way, in general, for automated text extraction for this type of font encodings is to use OCR (optical character recognition) software. However, in this case, the used type of font ('French Script MT') will not work well with OCR.

    13. In theory it should be possible to add PDF code to the existing PDF file which in effect adds the missing /ToUnicode table. I'm not aware of any tool which could do this automatically. To add this would involve:

      • re-verse engineering the PDF file, then
      • writing the table by hand, then
      • inserting it at the correct spot of the PDF file as a separte PDF object, then
      • inserting an entry pointing to that table into the font object's dictionary, and lastly
      • update the PDF file's xref table with the correct byte offsets to all object IDs which were affected by the changes.

    Resume -- My advice to you:

    1. Re-create your PDF.
    2. If possible, base your PDFCreator on a more recent version of Ghostscript.
    3. Change the setting of PDFCreator so that it doesn't create a font subset any more. Make sure the original font is fully embedded.

    Then very likely, the font encoding problem will go away and you'll be able to copy'n'paste text from your PDF.


    Update:

    I created 5 sample, hand-coded PDF files which expose the problem caused by a missing or an incorrect/manipulated /ToUnicode table in a PDF. These samples are now committed to our recently created GitHub repository devoted to provide sample PDF files which can be used to study, learn and explore the PDF syntax by looking at their source code. These 5 files are in sub-directory 'textextract':

    这篇关于PDF字体编码 - 为什么我不能从PDF复制文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆