阅读PDF,TJ操作员奇怪的编码 [英] Reading PDF, TJ operator strange encoding

查看:105
本文介绍了阅读PDF,TJ操作员奇怪的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从PDF文档中提取文本,但是在Tj运算符中遇到了一些奇怪的情况.通常我会处理以下情况:

I'm currently trying to extract text from a PDF document, but I encountered some strange cases with the Tj operator. Normally I dealt with cases like these:

   Tc (SOME_TEXT) TJ

现在我遇到这样的情况:

Now I encounter a case like this:

   Tm  [
        ( )1.828
        (5)1.841
        (2)1.828
        (2)1.828
        (4)1.841
        (9)1.828
        (.)1.828
        (6)1.841
        (4)
       ]
   TJ 

将转换为字符串'52249.64'.现在,我又遇到了另一个奇怪的情况:

Which converts to string '52249.64'. Now I have encountered yet another strange case:

我只能找到的信息是:始终根据字体的Encoding或CMap解释传递给Tj的字符串. (在这种情况下,我希望它是带有CMap的CIDFont)

Only info I could find is this: A string passed to Tj is always to be interpreted according to the Encoding or CMap for the font. (In this case I expect it is a CIDFont with a CMap)

Td  (
        \t\004\007\020\007\016\016\026\020
    )
Tj 

我还是不明白.这些索引是指示某种字符数组中的偏移量的索引,还是我必须对这些值进行解码?谢谢!

I still don't understand. Are these some kind of indexes that indicate an offset in some kind of character array or do I have to decode these values? Thanks!

推荐答案

正如@Paulo在其评论中已经指出的那样,您应该首先查阅PDF规范,即当前的ISO 32000-1,其免费副本由Adobe提供. a href ="http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf" rel ="nofollow">此处.

As @Paulo already indicated in his comment, you should first consult the PDF specification, i.e. currently ISO 32000-1 a free copy of which is provided by Adobe here.

关于文本提取的主题,您将在 9.10提取文本内容部分中找到,特别是:

On the topic of text extraction you'll find in particular section 9.10 Extraction of Text Content, especially:

9.10.2将字符代码映射到Unicode值

合格的读者可以按照给定的优先级使用这些方法,将字符代码映射到Unicode值.尤其是带标签的PDF文档,应至少提供以下方法之一(请参见14.8.2.4.2,带标签的PDF中的Unicode映射"):

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • 如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding MacExpertEncoding WinAnsiEncoding ,或者具有其 Differences 数组的编码,该数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a)根据表D.1和字体的差异数组将字符代码映射到字符名称.

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b)在 Adob​​e字形列表(请参见参考书目)中查找字符名称,以获得相应的Unicode值.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体,或者其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe- Japan1或Adobe-Korea1字符集:

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a)根据字体的CMap将字符代码映射到字符标识符(CID).

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c)通过连接注册表和在步骤(b)中获得的命令,构造第二个CMap名称,格式为 registry ordering –UCS2(例如Adobe– Japan1–UCS2).

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registryordering–UCS2 (for example, Adobe–Japan1–UCS2).

d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得;请参见参考书目).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e)根据步骤(d)中获得的CMap映射步骤(a)中获得的CID,以产生Unicode值.

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

注意其后代CIDFonts使用Adobe-GB1,Adobe-CNS1,Adobe-Japan1或Adobe-Korea1字符集(在 CIDSystemInfo 字典中指定)的Type 0字体应有一个补号.对应于合格阅读器支持的PDF版本.有关给定PDF版本的字符集列表,请参见表3. (可以使用这些字符集的其他增补,但是如果增补的编号大于对应于受支持的PDF版本的增补的编号,则仅将后者中的CID视为标准CID.)

NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择自己选择的字符代码.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

如果您不了解此处的某些术语,请在

If some of the terms here are unknown to you, read about them in ISO 32000-1 or the other specifications referenced there.

因此,要获得可接受的文本提取结果,请使文本提取器支持该部分介绍的方法.

这篇关于阅读PDF,TJ操作员奇怪的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆