PHP过滤器FlateDecode PDF流返回偏移字符 [英] PHP Filter FlateDecode PDF stream returning offset characters

查看：107 发布时间：2020/5/25 5:26:31 php pdf character-encoding text-extraction

本文介绍了PHP过滤器FlateDecode PDF流返回偏移字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有使用filetotext类从PDF提取文本的代码.一直工作到上周，直到生成的pdf发生了变化.奇怪的是，一旦我在字符的ord中添加29，字符就会出现并且正确.

I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are there and correct once I add 29 to the ord of the character.

示例响应调试打印输出:

Example response debug printout:

/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT

代码在pdf的stream部分使用gzuncompress. $ PRXQW是Amount，向每个字符的ord添加29dec就可以了.但是有时字符不是这种精确的翻译，例如文本中的)应该是5C66的两个字节.

The code uses gzuncompress on the stream section of the pdf. The $PRXQW is Amount, and adding 29dec to the ord of each character gives me this. But sometimes a character will not be this exact translation, such as what should be a ) in the text appears to be two bytes of 5C66.

只是想知道现在从PDF中出来的这种代码环字符，是否有人看过这种东西?

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

推荐答案

Tj 操作的字符串参数的编码完全取决于所使用的PDF字体( F1 (在手边的情况下):

The encoding of the string argument of the Tj operation depends entirely on the PDF font used (F1 in the case at hand):

文本显示运算符的字符串操作数应解释为标识要绘制的字形的字符代码序列.

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

使用简单字体时，字符串的每个字节均应视为单独的字符代码.然后应按照字体的编码查找字符代码，以选择字形，如9.6.6字符编码"中所述.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

使用复合字体(PDF 1.2)时，可以使用多字节代码来选择字形.在这种情况下，字符串的一个或多个连续字节应被视为单个字符代码.代码长度和从代码到字形的映射在称为CMap的数据结构中定义，如9.7复合字体"中所述.

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".

( OP的代码似乎采用了 MacRomanEncoding 或 WinAnsiEncoding 之类的标准编码，但这只是特殊情况.如上面引文所述，编码也可能是一些特殊的混合多字节编码.

The OP's code seems to assume a standard encoding like MacRomanEncoding or WinAnsiEncoding, but these merely are special cases. As indicated in the quote above, the encoding might as well be some ad-hoc mixed multibyte encoding.

后面的部分中的PDF规范描述了如何正确提取文本:

The PDF specification in a later section describes how to properly extract text:

合格的读者可以按照给定的优先级使用这些方法，将字符代码映射到Unicode值.尤其是带标签的PDF文档，应至少提供以下方法之一(请参见14.8.2.4.2，带标签的PDF中的Unicode映射"):

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

如果字体词典包含 ToUnicode CMap(请参见9.10.3，"ToUnicode CMaps")，请使用该CMap将字符代码转换为Unicode.

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding ， MacExpertEncoding 或 WinAnsiEncoding ，或者的编码格式，其Differences数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a)根据表D.1和字体的差异数组将字符代码映射到字符名称.

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b)在 Adobe字形列表(请参见参考书目)中查找字符名称，以获得相应的Unicode值.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体，或者其后代CIDFont使用Adobe-GB1，Adobe-CNS1，Adobe- Japan1或Adobe-Korea1字符集:

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a)根据字体的CMap将字符代码映射到字符标识符(CID).

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c)通过将注册表和在步骤(b)中获得的命令以注册表-排序-UCS2的格式(例如Adobe-Japan1-UCS2)连接起来，构造第二个CMap名称.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得；请参见参考书目).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e)根据在步骤(d)中获得的CMap映射在步骤(a)中获得的CID，从而产生Unicode值.

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

如果这些方法无法产生Unicode值，则无法确定字符代码代表什么，在这种情况下，合格的读者可以选择自己选择的字符代码.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(因此:

只是想知道现在从PDF中出来的这种代码环字符，是否有人看过这种东西?

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

是的，从头到尾在PDF中都非常普遍，其文本绘图操作符字符串参数的编码方式与ASCII形式的编码完全不同.正如上面第二个引号中的最后一段所暗示的那样，即使有其他地方可以寻找到Unicode的映射，也存在根本不允许文本提取(即没有OCR)的情况.

Yes, it is fairly common in PDFs from the wild to have text drawing operator string arguments in an encoding entirely different from something ASCII'ish. And as the last paragraph in the second quote above hints at, there are situation not allowing text extraction at all (without OCR, that is), even though there are additional places one can look for the mapping to Unicode.

这篇关于PHP过滤器FlateDecode PDF流返回偏移字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PHP过滤器FlateDecode PDF流返回偏移字符 [英] PHP Filter FlateDecode PDF stream returning offset characters

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

PHP过滤器FlateDecode PDF流返回偏移字符 [英] PHP Filter FlateDecode PDF stream returning offset characters

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭