Type0 CMap解析问题 [英] Issue for Type0 CMap parsing

查看:137
本文介绍了Type0 CMap解析问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用PDFKitten进行iOS PDF扫描. 我正在尝试提取文本以在具有Type0字体的PDF中进行搜索.我无法从PDF中提取文本. ToUnicode 中的某些条目丢失,并且某些条目被误解了. CMap的解析是否存在问题? 如果我没有完整的CMap,应该如何导出? 我可以为这些缺少的 ToUnicode 条目输入外部条目吗?

I am currently working on iOS PDF scanning using PDFKitten. I am trying to extract text for searching in PDF having Type0 font. I am not able to extract text from the PDF. Some entries in ToUnicode are missing and some are misinterpreted. Can there be issue with parsing of the CMap? If I don't have complete CMap, how should I derive it? Can I take external entries for these missing ToUnicode entries?

谢谢

推荐答案

  • 如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.

    • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

    如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding MacExpertEncoding WinAnsiEncoding ,或者具有其 Differences 数组的编码,该数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):

    If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a)根据表D.1和字体的差异数组将字符代码映射到字符名称.

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b)在 Adob​​e字形列表中查找字符名称(请参见参考书目),以获取 对应的Unicode值.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

    如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体,或者其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe- Japan1或Adobe-Korea1字符集:

    If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a)根据字体的CMap将字符代码映射到字符标识符(CID).

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c)通过将注册表和步骤(b)中获得的命令连接起来,以注册表-命令-UCS2的格式(例如Adobe-Japan1-UCS2)来构造第二个CMap名称.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得;请参见参考书目).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e)根据在步骤(d)中获得的CMap映射在步骤(a)中获得的CID,从而产生Unicode值.

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

    此外,如9.10.1节所述,

    Furthermore, as section 9.10.1 indicates,

    • 用于结构元素或标记内容序列的ActualText条目(请参见14.9.4,替换 文字")可用于直接指定文字内容
    • An ActualText entry for a structure element or marked-content sequence (see 14.9.4, "Replacement Text") may be used to specify the text content directly

    根据规范,如果这些方法无法产生Unicode值,则无法确定字符代码代表什么.例如嵌入式字体程序可能包含其自己的Unicode映射;但是这些额外的信息来源超出了实际的PDF格式.

    According to the specification, if these methods fail to produce a Unicode value, there is no way to determine what the character code represents. This is not entirely true; e.g. embedded font programs may contain their own mappings to Unicode; but such additional sources of information are beyond the actual PDF format.

    编辑

    OP通过邮件提供并提供了有问题的文件iPhoneConfigurationProfileRef-2013-GM.pdf

    The OP provided the file in question, iPhoneConfigurationProfileRef-2013-GM.pdf, via mail and indicated

    每个字形都给我带来了麻烦.

    I am getting problem for every glyph.

    问题在于PDF中存在的范围不完整,并且与adobe-identity-cmap文件不同.

    The issue is that ranges present in PDF are not complete and are different from adobe-identity-cmap file.

    如果仅使用嵌入在PDF中的CMap,则每个字符都没有映射,如果使用adobe,则所有映射都是错误的.

    If I only use CMap embedded in PDF, I get no mapping for every character and if I use adobe one the all mappings are wrong.

    由于他没有获得任何字形的映射,因此让我们以标题页为例.

    As he didn't get a mapping for any glyph, let us look at the title page as an example.

    内容流包含与文本提取相关的以下操作:

    The content stream contains these operation relevant for text extraction:

    BT 
    50 0 0 50 60 669.225 Tm 
    /G1 1 Tf 
    <0025> Tj 
    ET 
    BT 
    50 0 0 50 87.6 669.225 Tm 
    /G1 1 Tf 
    <005100500048004b004900570054> Tj 
    ET 
    BT 
    50 0 0 50 238 669.225 Tm 
    /G1 1 Tf 
    <0043> Tj 
    ET 
    BT 
    50 0 0 50 261.45 669.225 Tm 
    /G1 1 Tf 
    <0056004b00510050> Tj 
    ET 
    BT 
    50 0 0 50 355.4 669.225 Tm 
    /G1 1 Tf
    <0032> Tj 
    ET 
    BT 
    50 0 0 50 380.75 669.225 Tm 
    /G1 1 Tf 
    <0054> Tj 
    ET 
    BT 
    50 0 0 50 396.55 669.225 Tm 
    /G1 1 Tf 
    <00510048004b004e0047> Tj 
    ET 
    BT 50 0 0 50 60 609.225 Tm 
    /G1 1 Tf 
    <0034> Tj 
    ET 
    BT 
    50 0 0 50 86.65 609.225 Tm 
    /G1 1 Tf 
    <00470048> Tj 
    ET 
    BT
    50 0 0 50 125.05 609.225 Tm 
    /G1 1 Tf 
    <00470054> Tj 
    ET 
    BT 
    50 0 0 50 165.45 609.225 Tm 
    /G1 1 Tf 
    <004700500045> Tj 
    ET 
    BT 
    50 0 0 50 238.9 609.225 Tm 
    /G1 1 Tf 
    <0047> Tj 
    ET
    

    因此,我们只需要查看第1页上的字体 G1 .幸运的是,该字体具有 ToUnicode 映射:

    So we need to look only at the font G1 on page 1. Fortunately the font has a ToUnicode map:

    /CIDInit /ProcSet findresource begin
    12 dict begin
    begincmap
    /CIDSystemInfo <<
      /Registry (Adobe)
      /Ordering (UCS)
      /Supplement 0
    >> def
    /CMapName /Adobe-Identity-UCS def
    /CMapType 2 def
    1 begincodespacerange
    <0000><FFFF>
    endcodespacerange
    1 beginbfchar
    <000f><002d 2010>
    endbfchar
    15 beginbfrange
    <0002><0002><0020>
    <0004><000c><0022>
    <000e><000e><002c>
    <0010><001d><002e>
    <001f><001f><003d>
    <0022><0032><0040>
    <0034><003d><0052>
    <003f><003f><005d>
    <0041><0041><005f>
    <0043><005c><0061>
    <005e><005e><007c>
    <008a><008a><00a9>
    <00a4><00a4><2014>
    <00a5><00a6><201c>
    <00a8><00a8><2019>
    endbfrange
    endcmap
    CMapName currentdict /CMap defineresource pop
    end
    end 
    

    尝试应用此映射(基于显式的beginbfrange...endbfrange条目):

    Trying to apply this map one gets (based on the explicit beginbfrange...endbfrange entries):

    <0025> Tj                          % "C"       = <0043>                         due to <0022><0032><0040>
    <005100500048004b004900570054> Tj  % "onfigur" = <006f006e00660069006700750072> due to <0043><005c><0061>
    <0043> Tj                          % "a"       = <0061>                         due to <0043><005c><0061>
    <0056004b00510050> Tj              % "tion"    = <00740069006f006e>             due to <0043><005c><0061>
    <0032> Tj                          % "P"       = <0050>                         due to <0022><0032><0040>
    <0054> Tj                          % "r"       = <0072>                         due to <0043><005c><0061>
    <00510048004b004e0047> Tj          % "ofile"   = <006f00660069006c0065>         due to <0043><005c><0061>
    <0034> Tj                          % "R"       = <0052>                         due to <0034><003d><0052>
    <00470048> Tj                      % "ef"      = <00650066>                     due to <0043><005c><0061>
    <00470054> Tj                      % "er"      = <00650072>                     due to <0043><005c><0061>
    <004700500045> Tj                  % "enc"     = <0065006e0063>                 due to <0043><005c><0061>
    <0047> Tj                          % "e"       = <0065>                         due to <0043><005c><0061>
    

    这与页面的外观非常匹配:

    This very well matches the appearance of the page:

    这篇关于Type0 CMap解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆