在IE / Firefox / Chrome中PDF文本文本显示不同 [英] PDF doc text shows differently in IE / Firefox / Chrome
问题描述
我设法产生一个简单的文件。档案是此处
文件在Adobe Acrobat Reader中完美打开,显示字符串אאאווותתת。
它在IE中完全打开。
问题是其他观众显示不好:
谷歌浏览器/谷歌文件显示它没有所有 Mozilla Firefox显示非常糟糕,显示了一些字母很多次,并在页面上的奇怪的地方..。 。
我做错了什么?
文件有什么问题?
我知道这是一个棘手的问题。
一个非常简短的简介
PDF中的字体是 PDF对象 - 字体
字典,包含选择字形所必需的大量参数和子字典,显示他们和翻译字符代码为逻辑(Unicode)表示内容提取。正如我们将它们看作* .ttf或* .pfb文件一样,外行字词的字体称为字体程序,嵌入式或外部字体,并且由< code $>字体对象
字体分为两组:简单的字体(Type1,Type3或TrueType),其中的字形是通过显示的字符串获得的单字节字符代码来选择的。
由文本显示运营商。从代码到字形的映射被称为字体的编码,它既可以内置到字体程序中,也可以由 Font
对象定义名称或明确),或在特殊情况下,由观众申请按照规定的规则构建。 有问题的文件不包含简单的字体,我们不会再讨论它们 - 但是请注意,复杂的字体(类型0),用于显示文本的内容。复杂的字体(类型0),用于显示文本这些字符代码可以具有可变长度(最多4字节),并且因此不限于256个代码点。 Type 字体总是有一个名为 在我所说的简单情况下(我认为是明智的决定), 但是,有问题的文件。 在我们的文件中, ,得到这个字符串: 当然没有组,他们在这里是因为我根据 CIDFont
的后代,这是一个类似字体的对象。类似于简单字体的编码,一个 CMap
对象将字符代码映射到字符选择器,在PDF中它总是 CIDs <
现在,字符选择符( CID
)通常不直接用于从字体程序中选择字形。对于 CIDFont
的 CIDFontType2
类型,其字典包含 CIDToGIDMap
条目,显然,它将 CID
映射到字形标识符。这些 GIDs
最后用于从嵌入的字体程序中选择字形(对于 CIDFontType2
font是TrueType字体程序(不要与 Font
TrueType的对象混淆 子类型
))。
字体
对象可以有 ToUnicode
资源,将CID映射为Unicode值以进行索引,搜索和提取。它被称为 ToUnicode Cmap
(因为它遵循类似的语法),但不应该与 CMap
对象混淆,以上。
CMap
是预定义的 Identity-H 名称 CIDToGIDMap
是预定义的标识名称,因此,从字符串文本显示运算符的参数)始终是2个字节的数字,有效地,直接从嵌入的 TrueType 程序中选择字形。从我的经验来看,这是最常见的情况,而且看起来就是这种情况,通用软件就是在这种情况下进行测试的。
$ b (简短简介的结尾)
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' > CMap $ c $其中包含2个范围:
< 20> < 20 - ;
< 0000> < 19FF>简单来说,如果我们在 CMap中查找字符代码
并获取CID,然后在 CIDToGIDMap
中查找CID并获取GID,然后在嵌入的 David-Bold 中查找GID。字体并获取Unicode值,这里是表格
代码CID GID Unicode名称
0x000a 10 180 05EA tav
0x0020 32 159 05D5 vav
0x0025 37 154 05D0 alef
0x20 228 03 0020 space
现在我们有足够的信息去猜测,查看器应用程序有什么困惑。在我的第一次尝试中,我建议它是用于非空格字符的 32
代码(和 CID
)(见上面的注释)。这个假设是基于一个案例,几年前,(旧版本的)Acrobat在 0x20
代码时不显示字符,当它在字符串的末尾时 - - 假设它是 space
,事实上,根据编码向量(简单字体),它是另一个字符。
我改变了这个:
$ b $ ul
$ 0x0020
到 0x0004
在内容流中;
CIDToGIDMap
到GID = 159; 宽度中的值
CID = 4的数组到'vav'width; ToUnicode cmap
相应地进行了调整。 CMAP $ c $>中删除< 0020> 32
c> - 没有反映在文件中,链接在评论中)
好的,这确实有帮助,但不幸的是,一些观众仍然拒绝遵守规范。
然后我想,也许变量字符代码宽度是问题。 b
$ b
我回到了原始文件并改变了它:
0x20
到 0x00e4
内容流;
< 20> 228
到< 00e4> 228
在 CMAP
;
codespacerange
< 20 - ;
CMAP
已删除
codespacerange
< 20> < code>< code> p> 这个文件似乎可以在原始问题中提到的所有查看者中完美地打开,下面的评论。奇迹般地,
0x0020
代码和 32
CID
不会干涉。
结论我认为可以这样:
鉴于目前的情况,PDF创建者建议在字体编码( CMAP
)中混合使用单字节和双字节代码。
I try to produce a PDF text file with Hebrew text.
I managed to produce a simple file. file is here
The file opens in Adobe Acrobat Reader perfectly, showing the string "אאא ווו תתת". It opens perfectly also in IE.
The problem is other viewers show it badly: Google Chrome / Google Docs show it without all "ו" occurances (that is, three letters "ו" disapear!)
Mozilla Firefox show it very badly, showing some letters many times and in odd places on the page...
What am I doing wrong?? What is wrong in the file?
I know this is a tough question.
Any help will be appreciated...
A very short and simplified introduction
Fonts in PDF are PDF objects - Font
dictionaries, containing numerous parameters and sub-dictionaries, necessary to select glyphs, show them and translate character codes to logical (Unicode) representation for content extraction. Fonts in layman terms -- as we see them as *.ttf or *.pfb files -- are called font programs, either embedded or external, and are referred to by one of sub-dictionaries of Font
objects.
Fonts
are divided into two groups:
- Simple fonts (Type1, Type3 or TrueType), in which glyphs are selected by single-byte character codes obtained from a string that is shown by the text-showing operators. The mapping from codes to glyphs is called the font’s encoding, it can be either built-in into font program or defined by
Font
object (by predefined name or explicitly) or, under special circumstances, constructed according to defined rules by viewer application.
The file in question doesn't contain simple fonts, and we won't discuss them any further -- but, note, over-simplistic description doesn't even start to reflect any of real-life complexity.
- Composite fonts (Type0), used to show text in which character codes can have variable length (up to 4 bytes), and which, therefore, isn't restricted to 256 code-points. Type0 font always has one descendant which is a font-like object called
CIDFont
, and, similar to encoding for simple fonts, aCMap
object, that maps character codes to character selectors, which, in PDF, are alwaysCIDs
-- integers up to 65536.
Now, character selector (CID
) is not, in general, directly used to select glyphs from font program. For CIDFont
of CIDFontType2
type, its dictionary contains CIDToGIDMap
entry, that, obviously, maps CID
to glyph identifiers. Those GIDs
are, at last, used to select glyphs from embedded font program (which, for CIDFontType2
font, is a TrueType font program (do not confuse with Font
object of TrueType Subtype
)).
Font
object can have ToUnicode
resource, that maps CIDs to Unicode values for indexing, searching and extraction. It's called ToUnicode Cmap
(as it follows similar syntax), but it should not to be confused with CMap
object, mentioned above.
In what I call a simple case (and, I think, sensible decision), CMap
is predefined Identity-H name, CIDToGIDMap
is a predefined Identity name, and, therefore, character codes extracted from a string (argument to text showing operator) are always 2-byte numbers that, effectively, directly select glyphs from embedded TrueType program. From my experience, it's most common scenario, and as it appears, that's the case, against which common software is tested.
But, it's not the case with file in question.
(The end of a short and simplified introduction)
In our file, text showing operator, effectively, gets this string:
0x000a 0x000a 0x000a 0x20 0x0020 0x0020 0x0020 0x20 0x0025 0x0025 0x0025
Of course there are no 'groups', they are here because I made them, based on CMap
that contains 2 ranges:
<20> <20>
<0000> <19FF>
To make a long story short, if we look up character codes in CMap
and get CIDs, then look up CIDs in CIDToGIDMap
and get GIDs, then look up GIDs in embedded David-Bold font and get Unicode values, here's the table
Code CID GID Unicode Name
0x000a 10 180 05EA tav
0x0020 32 159 05D5 vav
0x0025 37 154 05D0 alef
0x20 228 03 0020 space
Now we have enough information to speculate, what confuses viewer applications
In my first attempt, I suggested it's 32
code (and CID
) that's used for non-space character (see comment above). This assumption was based on a case, several years ago, when (older version of) Acrobat didn't show character with 0x20
code, when it's at the end of a string -- assuming it to be space
, when in fact, according to encoding vector (of a simple font), it was another character.
I changed this:
0x0020
to0x0004
in content stream;- bytes 08 and 09 in
CIDToGIDMap
to GID=159; - value in
Widths
array of CID=4 to 'vav' width; ToUnicode cmap
was adjusted accordingly.- (+ later I tried to remove
<0020> 32
string fromCMAP
- not reflected in a file, linked in comment)
Well, it did help, but unfortunately, some of viewers still rejected to comply to specification.
Then I thought, that maybe variable character code width was the issue.
I returned to the original file and changed this:
0x20
to0x00e4
in content stream;<20> 228
to<00e4> 228
inCMAP
;codespacerange
<20> <20>
inCMAP
deleted;codespacerange
<20> <20>
inToUnicode Cmap
deleted.
This file appears to open perfectly in all viewers, mentioned in original question and comments below. Miraculously, 0x0020
code and 32
CID
do not interfere.
The conclusion, I think, can be this:
Given current state of affairs, PDF-creators are NOT advised to mix single and double byte codes in font encoding (CMAP
).
这篇关于在IE / Firefox / Chrome中PDF文本文本显示不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!