解码PDF文档中的FlateDecoded文本部分 [英] Decoding a FlateDecoded section of text in a PDF document

查看:174
本文介绍了解码PDF文档中的FlateDecoded文本部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 peepdf 我正在分析两个简单的pdf文件.这两个文件都包含一行文本("ZYXWVUTSRQQRSTRSTWWXYZ"),并且是在Mac OS X上创建的.

Using peepdf I am analyzing two simple pdf files. Both files contain a single line of text ("ZYXWVUTSRQQRSTUVWXYZ") and were created on Mac OS X.

第一个文件是使用TextEdit创建的.只有三个流,查看第一个流(使用peepdf自动解码)可以清楚地看到文本.

The first file was created with TextEdit. There are only three streams, and looking at the first one (automatically decoded with peepdf) shows the text clearly.

PPDF> stream 4

q Q q 72 707.272 468 12.72803 re W n /Cs1 cs 0 sc q 0.9790795 0 0 -0.9790795 72 720
cm BT 0.0001 Tc 11 0 0 -11 5 10 Tm /TT1 1 Tf (ZYXWVUTSRQQRSTUVWXYZ) Tj ET
Q Q

第二个文件是使用MS Word创建的.有四个流,但是在哪里找不到解码后的文本.在Word文档中查看相应的流不会显示已解码的字符串:

The second file was created with MS Word. There are four streams but the decoded text is no where to be found. Looking at the corresponding stream in the Word doc does not reveal the decoded string:

PPDF> stream 4

q Q q 18 40 576 734 re W n /Cs1 cs 0 0 0 sc q 0.24 0 0 0.24 90 708.72 cm BT
-0.0004 Tc 50 0 0 50 0 0 Tm /TT2 1 Tf [ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\))
-1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ ET Q q 0.24 0 0 0.24 239.168 708.72
cm BT 50 0 0 50 0 0 Tm /TT2 1 Tf (+) Tj ET Q Q

对我来说,字符串在文件中的位置或该流中的信息的含义对我来说并不明显.有什么见识吗?

It's not apparent to me where the string is in the file or what the information in this stream means. Any insights?

推荐答案

对我来说,字符串在文件中的位置不明显

It's not apparent to me where the string is in the file

通常,您不会在内容流中看到明文,因为那里使用的编码不必是标准编码,也不需要ASCII码.

In general you won't see the clear text in the content stream because the encoding used there needs not be a standard encoding, nothing ASCII'ish.

[ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\)) -1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ

此操作在其数组操作数中包含您的ZYXWVUTSRQQRSTUVWXYZ,并且对某些字符对进行了字距调整.

This operation in its array operand contains your ZYXWVUTSRQQRSTUVWXYZ with some kerning corrections for certain pairs of characters.

看起来像是从33(= 0x21 ='!')开始的字节的临时编码. '!'用于所需的第一个字形, Z ,"用于所需的第二个字形, Y ,' #'代表第三个 X ,依此类推.您的测试字符串不仅以这些字符开头,而且以它们结尾,上面的数组(!") -1 (#)也是如此. .. (#) -1 (") -1 (!).

It looks like an ad hoc encoding using the bytes from 33 (= 0x21 = '!') onwards. '!' is used for the first glyph needed, the Z, '"' for the second one needed Y, '#' for the third one X, etc. Your test string not only starts with these chars but also ends with them, and so does the array above, (!") -1 (#) ... (#) -1 (") -1 (!).

检查所用字体( TT2 )的定义.它可能包含(也可能不包含)帮助您解码此编码的信息.

Inspect the definition of the font used (TT2). It may (or may not) include information helping you decoding this encoding.

或此流中的信息意味着什么.有什么见识吗?

or what the information in this stream means. Any insights?

要了解PDF内容流的内容,您应该阅读PDF规范的相关部分

To understand the contents of PDF content streams, you should read the relevant sections of the PDF specification ISO 32000-1, especially chapters 8 Graphics and 9 Text.

由于您的问题集中在文本内容的识别上,例如阅读第9.10.2节将字符代码映射为Unicode值:

As your question is focused on the recognition of text content, e.g. read section 9.10.2 Mapping Character Codes to Unicode Values:

合格的读者可以按照给定的优先级使用这些方法,将字符代码映射到Unicode值.尤其是带标签的PDF文档,应至少提供以下方法之一(请参见14.8.2.4.2,带标签的PDF中的Unicode映射"):

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • 如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding MacExpertEncoding WinAnsiEncoding ,或者具有其 Differences 数组的编码,该数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a)根据表D.1和字体的差异数组将字符代码映射到字符名称.

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b)在 Adob​​e字形列表(请参见参考书目)中查找字符名称,以获得相应的Unicode值.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体,或者其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe- Japan1或Adobe-Korea1字符集:

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a)根据字体的CMap将字符代码映射到字符标识符(CID).

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c)通过将步骤(b)中获得的注册表和命令串联起来,以 registry-ordering -UCS2的格式(例如Adobe-Japan1-UCS2)来构造第二个CMap名称.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得;请参见参考书目).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e)根据在步骤(d)中获得的CMap映射在步骤(a)中获得的CID,从而产生Unicode值.

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

注意其后代CIDFonts使用Adobe-GB1,Adobe-CNS1,Adobe-Japan1或Adobe-Korea1字符集(在CIDSystemInfo词典中指定)的Type 0字体应具有与支持的PDF版本相对应的补号.符合要求的读者.有关与给定PDF版本相对应的字符集的列表,请参见表3. (可以使用这些字符集的其他增补,但是如果增补的编号大于对应于受支持的PDF版本的增补的编号,则仅将后者中的CID视为标准CID.)

NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择自己选择的字符代码.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

关于评论

其中一个对象提供了一些字体信息.它是"JJOWGO + Cambria",并且将对象16称为字体文件",这也是不可读的.我将审查该手册.在网上找不到有关"JJOWGO"的任何信息.

One of the objects gave some font info. It is 'JJOWGO+Cambria' and references object 16 as the 'font file' which was also unreadable. I'll review the manual. Can't find anything online about 'JJOWGO'.

您将找不到有关JJOWGO的任何特定信息,因为它很可能是以Cambria为前缀的随机键序列,以指示并非嵌入了所有字体,而只是嵌入了一个子集. cf. nofollow> ISO 32000-1 :

You wont find anything specific about JJOWGO because it most likely is a random key sequence prefixed to Cambria to indicate that not all of that font is embedded but only a subset. Cf. section 9.6.4 Font Subsets of ISO 32000-1:

PDF文档可能包含Type 1和TrueType字体的子集.描述字体子集的字体和字体描述符与普通字体略有不同.这些差异允许合格的阅读器识别字体子集,并合并包含同一字体的不同子集的文档. (有关字体描述符的更多信息,请参见9.8,字体描述符".)

PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow a conforming reader to recognize font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)

对于字体子集,字体的PostScript名称(字体的 BaseFont 条目的值和字体描述符的 FontName 条目)应以标签开头,后跟标签加号(+).标签应正好由六个大写字母组成;字母的选择是任意的,但是同一PDF文件中的不同子集应具有不同的标签.

For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.

示例EOODIA + Poetica是Poetica®(一种1型字体)的子集的名称.

EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font.

 <<
 /FontBBox [ -1475 -2463 2867 3117 ]
 /StemV 0
 /FontFile2 16 0 R
 /Descent -222
 /XHeight 467
 /Flags 4
 /Ascent 950
 /FontName /JJOWGO+Cambria
 /Type /FontDescriptor
 /ItalicAngle 0
 /AvgWidth 615
 /MaxWidth 2919
 /CapHeight 667
 >>

此字体描述符不包含明显的编码信息.查看实际的字体词典,并找到 ToUnicode 条目,请参见.引用以上9.10.2节的内容.

This font descriptor contains no obvious encoding information. Have a look at the actual Font dictionary and look for a ToUnicode entry, cf. the quotation of section 9.10.2 above.

这篇关于解码PDF文档中的FlateDecoded文本部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆