如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.
如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding , MacExpertEncoding 或 WinAnsiEncoding ,或者具有其 Differences 数组的编码,该数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a)根据表D.1和字体的差异数组将字符代码映射到字符名称.
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b)在 Adobe字形列表中查找字符名称(请参见参考书目),以获取
对应的Unicode值.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the
corresponding Unicode value.
如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体,或者其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe- Japan1或Adobe-Korea1字符集:
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a)根据字体的CMap将字符代码映射到字符标识符(CID).
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c)通过将注册表和步骤(b)中获得的命令连接起来,以注册表-命令-UCS2的格式(例如Adobe-Japan1-UCS2)来构造第二个CMap名称.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得;请参见参考书目).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e)根据在步骤(d)中获得的CMap映射在步骤(a)中获得的CID,从而产生Unicode值.
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
此外,如9.10.1节所述,
Furthermore, as section 9.10.1 indicates,
- 用于结构元素或标记内容序列的ActualText条目(请参见14.9.4,替换
文字")可用于直接指定文字内容
- An ActualText entry for a structure element or marked-content sequence (see 14.9.4, "Replacement
Text") may be used to specify the text content directly
根据规范,如果这些方法无法产生Unicode值,则无法确定字符代码代表什么.例如嵌入式字体程序可能包含其自己的Unicode映射;但是这些额外的信息来源超出了实际的PDF格式.
According to the specification, if these methods fail to produce a Unicode value, there is no way to determine what the character code represents. This is not entirely true; e.g. embedded font programs may contain their own mappings to Unicode; but such additional sources of information are beyond the actual PDF format.
编辑
OP通过邮件提供并提供了有问题的文件iPhoneConfigurationProfileRef-2013-GM.pdf
The OP provided the file in question, iPhoneConfigurationProfileRef-2013-GM.pdf, via mail and indicated
每个字形都给我带来了麻烦.
I am getting problem for every glyph.
问题在于PDF中存在的范围不完整,并且与adobe-identity-cmap文件不同.
The issue is that ranges present in PDF are not complete and are different from adobe-identity-cmap file.
如果仅使用嵌入在PDF中的CMap,则每个字符都没有映射,如果使用adobe,则所有映射都是错误的.
If I only use CMap embedded in PDF, I get no mapping for every character and if I use adobe one the all mappings are wrong.
由于他没有获得任何字形的映射,因此让我们以标题页为例.
As he didn't get a mapping for any glyph, let us look at the title page as an example.
内容流包含与文本提取相关的以下操作:
The content stream contains these operation relevant for text extraction:
BT
50 0 0 50 60 669.225 Tm
/G1 1 Tf
<0025> Tj
ET
BT
50 0 0 50 87.6 669.225 Tm
/G1 1 Tf
<005100500048004b004900570054> Tj
ET
BT
50 0 0 50 238 669.225 Tm
/G1 1 Tf
<0043> Tj
ET
BT
50 0 0 50 261.45 669.225 Tm
/G1 1 Tf
<0056004b00510050> Tj
ET
BT
50 0 0 50 355.4 669.225 Tm
/G1 1 Tf
<0032> Tj
ET
BT
50 0 0 50 380.75 669.225 Tm
/G1 1 Tf
<0054> Tj
ET
BT
50 0 0 50 396.55 669.225 Tm
/G1 1 Tf
<00510048004b004e0047> Tj
ET
BT 50 0 0 50 60 609.225 Tm
/G1 1 Tf
<0034> Tj
ET
BT
50 0 0 50 86.65 609.225 Tm
/G1 1 Tf
<00470048> Tj
ET
BT
50 0 0 50 125.05 609.225 Tm
/G1 1 Tf
<00470054> Tj
ET
BT
50 0 0 50 165.45 609.225 Tm
/G1 1 Tf
<004700500045> Tj
ET
BT
50 0 0 50 238.9 609.225 Tm
/G1 1 Tf
<0047> Tj
ET
因此,我们只需要查看第1页上的字体 G1 .幸运的是,该字体具有 ToUnicode 映射:
So we need to look only at the font G1 on page 1. Fortunately the font has a ToUnicode map:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
1 beginbfchar
<000f><002d 2010>
endbfchar
15 beginbfrange
<0002><0002><0020>
<0004><000c><0022>
<000e><000e><002c>
<0010><001d><002e>
<001f><001f><003d>
<0022><0032><0040>
<0034><003d><0052>
<003f><003f><005d>
<0041><0041><005f>
<0043><005c><0061>
<005e><005e><007c>
<008a><008a><00a9>
<00a4><00a4><2014>
<00a5><00a6><201c>
<00a8><00a8><2019>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
尝试应用此映射(基于显式的beginbfrange...endbfrange
条目):
Trying to apply this map one gets (based on the explicit beginbfrange...endbfrange
entries):
<0025> Tj % "C" = <0043> due to <0022><0032><0040>
<005100500048004b004900570054> Tj % "onfigur" = <006f006e00660069006700750072> due to <0043><005c><0061>
<0043> Tj % "a" = <0061> due to <0043><005c><0061>
<0056004b00510050> Tj % "tion" = <00740069006f006e> due to <0043><005c><0061>
<0032> Tj % "P" = <0050> due to <0022><0032><0040>
<0054> Tj % "r" = <0072> due to <0043><005c><0061>
<00510048004b004e0047> Tj % "ofile" = <006f00660069006c0065> due to <0043><005c><0061>
<0034> Tj % "R" = <0052> due to <0034><003d><0052>
<00470048> Tj % "ef" = <00650066> due to <0043><005c><0061>
<00470054> Tj % "er" = <00650072> due to <0043><005c><0061>
<004700500045> Tj % "enc" = <0065006e0063> due to <0043><005c><0061>
<0047> Tj % "e" = <0065> due to <0043><005c><0061>
这与页面的外观非常匹配:
This very well matches the appearance of the page:
这篇关于Type0 CMap解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!