在IE / Firefox / Chrome中PDF文本文本显示不同 [英] PDF doc text shows differently in IE / Firefox / Chrome

查看:242
本文介绍了在IE / Firefox / Chrome中PDF文本文本显示不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用希伯来语文本产生一个PDF文本文件。

我设法产生一个简单的文件。档案是此处



文件在Adobe Acrobat Reader中完美打开,显示字符串אאאווותתת。
它在IE中完全打开。



问题是其他观众显示不好:
谷歌浏览器/谷歌文件显示它没有所有 Mozilla Firefox显示非常糟糕,显示了一些字母很多次,并在页面上的奇怪的地方..。 。



我做错了什么?
文件有什么问题?



这个文件的链接在这里



我知道这是一个棘手的问题。


解决方案

一个非常简短的简介



PDF中的字体是 PDF对象 - 字体字典,包含选择字形所必需的大量参数和子字典,显示他们和翻译字符代码为逻辑(Unicode)表示内容提取。正如我们将它们看作* .ttf或* .pfb文件一样,外行字词的字体称为字体程序,嵌入式或外部字体,并且由< code $>字体对象

字体分为两组:简单的字体(Type1,Type3或TrueType),其中的字形是通过显示的字符串获得的单字节字符代码来选择的。

由文本显示运营商。从代码到字形的映射被称为字体的编码,它既可以内置到字体程序中,也可以由 Font 对象定义名称或明确),或在特殊情况下,由观众申请按照规定的规则构建。


有问题的文件不包含简单的字体,我们不会再讨论它们 - 但是请注意,复杂的字体(类型0),用于显示文本的内容。复杂的字体(类型0),用于显示文本这些字符代码可以具有可变长度(最多4字节),并且因此不限于256个代码点。 Type 字体总是有一个名为 CIDFont 后代,这是一个类似字体的对象。类似于简单字体的编码,一个 CMap 对象将字符代码映射到字符选择器,在PDF中它总是 CIDs <


现在,字符选择符( CID )通常不直接用于从字体程序中选择字形。对于 CIDFont CIDFontType2 类型,其字典包含 CIDToGIDMap 条目,显然,它将 CID 映射到字形标识符。这些 GIDs 最后用于从嵌入的字体程序中选择字形(对于 CIDFontType2 font是TrueType字体程序(不要与 Font TrueType的对象混淆 子类型))。

字体对象可以有 ToUnicode 资源,将CID映射为Unicode值以进行索引,搜索和提取。它被称为 ToUnicode Cmap (因为它遵循类似的语法),但不应该与 CMap 对象混淆,以上。

在我所说的简单情况下(我认为是明智的决定), CMap 是预定义的 Identity-H 名称 CIDToGIDMap 是预定义的标识名称,因此,从字符串文本显示运算符的参数)始终是2个字节的数字,有效地,直接从嵌入的 TrueType 程序中选择字形。从我的经验来看,这是最常见的情况,而且看起来就是这种情况,通用软件就是在这种情况下进行测试的。

但是,有问题的文件。
$ b

(简短简介的结尾)



在我们的文件中, ,得到这个字符串:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' >

当然没有组,他们在这里是因为我根据 CMap

 < 20> < 20  - ; 
< 0000> < 19FF>简单来说,如果我们在 CMap中查找字符代码并获取CID,然后在 CIDToGIDMap 中查找CID并获取GID,然后在嵌入的 David-Bold 中查找GID。字体并获取Unicode值,这里是表格

 代码CID GID Unicode名称

0x000a 10 180 05EA tav
0x0020 32 159 05D5 vav
0x0025 37 154 05D0 alef
0x20 228 03 0020 space

现在我们有足够的信息去猜测,查看器应用程序有什么困惑。在我的第一次尝试中,我建议它是用于非空格字符的 32 代码(和 CID )(见上面的注释)。这个假设是基于一个案例,几年前,(旧版本的)Acrobat在 0x20 代码时不显示字符,当它在字符串的末尾时 - - 假设它是 space ,事实上,根据编码向量(简单字体),它是另一个字符。



我改变了这个:
$ b $ ul
$ 0x0020 0x0004 在内容流中;

  • 字节08和09在 CIDToGIDMap 到GID = 159;

  • 宽度中的值 CID = 4的数组到'vav'width;

  • ToUnicode cmap 相应地进行了调整。

  • (+后来我尝试从 CMAP 中删除< 0020> 32 c> - 没有反映在文件中,链接在评论中)



  • 好的,这确实有帮助,但不幸的是,一些观众仍然拒绝遵守规范。




    然后我想,也许变量字符代码宽度是问题。 b
    $ b

    我回到了原始文件并改变了它:


    • 0x20 0x00e4 内容流;
    • < 20> 228 < 00e4> 228 CMAP ;

    • codespacerange < 20 - ; CMAP 已删除

    • codespacerange < 20> < code>< code> p> 这个文件似乎可以在原始问题中提到的所有查看者中完美地打开,下面的评论。奇迹般地, 0x0020 代码和 32 CID 不会干涉。




      结论我认为可以这样:

      鉴于目前的情况,PDF创建者建议在字体编码( CMAP )中混合使用单字节和双字节代码。


      I try to produce a PDF text file with Hebrew text.

      I managed to produce a simple file. file is here

      The file opens in Adobe Acrobat Reader perfectly, showing the string "אאא ווו תתת". It opens perfectly also in IE.

      The problem is other viewers show it badly: Google Chrome / Google Docs show it without all "ו" occurances (that is, three letters "ו" disapear!)

      Mozilla Firefox show it very badly, showing some letters many times and in odd places on the page...

      What am I doing wrong?? What is wrong in the file?

      A link to the file is here

      I know this is a tough question.

      Any help will be appreciated...

      解决方案

      A very short and simplified introduction

      Fonts in PDF are PDF objects - Font dictionaries, containing numerous parameters and sub-dictionaries, necessary to select glyphs, show them and translate character codes to logical (Unicode) representation for content extraction. Fonts in layman terms -- as we see them as *.ttf or *.pfb files -- are called font programs, either embedded or external, and are referred to by one of sub-dictionaries of Font objects.

      Fonts are divided into two groups:

      • Simple fonts (Type1, Type3 or TrueType), in which glyphs are selected by single-byte character codes obtained from a string that is shown by the text-showing operators. The mapping from codes to glyphs is called the font’s encoding, it can be either built-in into font program or defined by Font object (by predefined name or explicitly) or, under special circumstances, constructed according to defined rules by viewer application.

      The file in question doesn't contain simple fonts, and we won't discuss them any further -- but, note, over-simplistic description doesn't even start to reflect any of real-life complexity.

      • Composite fonts (Type0), used to show text in which character codes can have variable length (up to 4 bytes), and which, therefore, isn't restricted to 256 code-points. Type0 font always has one descendant which is a font-like object called CIDFont, and, similar to encoding for simple fonts, a CMap object, that maps character codes to character selectors, which, in PDF, are always CIDs -- integers up to 65536.

      Now, character selector (CID) is not, in general, directly used to select glyphs from font program. For CIDFont of CIDFontType2 type, its dictionary contains CIDToGIDMap entry, that, obviously, maps CID to glyph identifiers. Those GIDs are, at last, used to select glyphs from embedded font program (which, for CIDFontType2 font, is a TrueType font program (do not confuse with Font object of TrueType Subtype)).

      Font object can have ToUnicode resource, that maps CIDs to Unicode values for indexing, searching and extraction. It's called ToUnicode Cmap (as it follows similar syntax), but it should not to be confused with CMap object, mentioned above.

      In what I call a simple case (and, I think, sensible decision), CMap is predefined Identity-H name, CIDToGIDMap is a predefined Identity name, and, therefore, character codes extracted from a string (argument to text showing operator) are always 2-byte numbers that, effectively, directly select glyphs from embedded TrueType program. From my experience, it's most common scenario, and as it appears, that's the case, against which common software is tested.

      But, it's not the case with file in question.

      (The end of a short and simplified introduction)

      In our file, text showing operator, effectively, gets this string:

      0x000a 0x000a 0x000a 0x20 0x0020 0x0020 0x0020 0x20 0x0025 0x0025 0x0025 
      

      Of course there are no 'groups', they are here because I made them, based on CMap that contains 2 ranges:

      <20> <20>
      <0000> <19FF>
      

      To make a long story short, if we look up character codes in CMap and get CIDs, then look up CIDs in CIDToGIDMap and get GIDs, then look up GIDs in embedded David-Bold font and get Unicode values, here's the table

      Code        CID     GID     Unicode     Name
      
      0x000a      10      180     05EA        tav
      0x0020      32      159     05D5        vav
      0x0025      37      154     05D0        alef
      0x20        228     03      0020        space
      

      Now we have enough information to speculate, what confuses viewer applications


      In my first attempt, I suggested it's 32 code (and CID) that's used for non-space character (see comment above). This assumption was based on a case, several years ago, when (older version of) Acrobat didn't show character with 0x20 code, when it's at the end of a string -- assuming it to be space, when in fact, according to encoding vector (of a simple font), it was another character.

      I changed this:

      • 0x0020 to 0x0004 in content stream;
      • bytes 08 and 09 in CIDToGIDMap to GID=159;
      • value in Widths array of CID=4 to 'vav' width;
      • ToUnicode cmap was adjusted accordingly.
      • (+ later I tried to remove <0020> 32 string from CMAP - not reflected in a file, linked in comment)

      Well, it did help, but unfortunately, some of viewers still rejected to comply to specification.


      Then I thought, that maybe variable character code width was the issue.

      I returned to the original file and changed this:

      • 0x20 to 0x00e4 in content stream;
      • <20> 228 to <00e4> 228 in CMAP;
      • codespacerange <20> <20> in CMAP deleted;
      • codespacerange <20> <20> in ToUnicode Cmap deleted.

      This file appears to open perfectly in all viewers, mentioned in original question and comments below. Miraculously, 0x0020 code and 32 CID do not interfere.


      The conclusion, I think, can be this:

      Given current state of affairs, PDF-creators are NOT advised to mix single and double byte codes in font encoding (CMAP).

      这篇关于在IE / Firefox / Chrome中PDF文本文本显示不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆