试图理解 PDF 中交叉引用 (XRef) 流中的数据 [英] Trying to understand data in cross-reference (XRef) stream in PDF

查看:76
本文介绍了试图理解 PDF 中交叉引用 (XRef) 流中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取线性化并使用交叉引用流的 PDF 文件.我相信除了表格中的最后两个条目之外,我基本上了解发生了什么.对于对象 5 和 6,这两个似乎正在使用中,但显示的文件偏移量大大超过了文件大小.此外,我拥有的 PDF 文件中甚至没有编号为 5 或 6 的对象.

这是交叉引用流:

I'm trying to read a PDF file that is linearized and uses cross-reference streams. I believe that I mostly understand what's happening except for the last two entries in the table. Those two, for objects 5 and 6, appear to be in use but show file offsets that vastly exceed the file size. Also, the PDF file I have doesn't even have objects number 5 or 6 in it.

Here is the cross-reference stream:

4 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<ED772C59D33BA74FA1DEE567740067A0><ED772C59D33BA74FA1DEE567740067A0>]/Info 6 0 R/Length 39/Root 8 0 R/Size 7/Type/XRef/W[1 3 0]>>stream

hfibb&F…ˆl&fit ¡ÿ"∏ôügÕ≤=‘

endstream


这里是 FlateDecode 之后的原始数据,按行排列.FlateDecode 报告 35 字节的数据被夸大了.


And here are the raw data after FlateDecode, arranged in rows. FlateDecode reports that 35 bytes of data were inflated.

02 00 00 00 00
02 01 19 87 6b
02 00 00 0d 67
02 00 00 01 8c
02 00 00 01 0b
02 01 e7 6a 99
02 00 00 00 01

我还应用了一个 PNG 预测函数(向上),它产生了 7 行,每行 4 个字节:

I also applied a PNG Predictor function (up) which yielded 7 rows of 4 bytes each:

00 00 00 00
01 19 87 6b
01 19 94 d2
00 00 0e f3
00 00 02 97
01 e7 6b a4
01 e7 6a 9a

第 0 行全为零,检查.对象 1 和 2 的偏移量实际上解决了 PDF 文件中的对象 1 和 2.到现在为止还挺好.对象 3 被标记为未使用,并且 PDF 文件中肯定没有对象 3.

但是,我有点困惑对象 4,这个交叉引用流,被标记为未使用.尽管如此,因为我解析的是对象 4,所以我显然没有困难地找到它.
但是我完全困惑的是对象 5 和 6 的行.第一列中的01"告诉我他们正在使用中.但是它们的偏移量超过了整个文件的大小,无论如何,文件中没有对象 5 和 6.字典中的 Size 条目显然有一个值 7,告诉我该表应该包含对象 0 到 6 的数据.过滤后,我有 28 字节的数据,这对七行有意义每个四个字节.

为什么会有 5 和 6 的条目?并且,鉴于它们在那里,为什么它们被标记为正在使用"并带有明显无意义的偏移量?

该文件似乎有效.Adobe Illustrator 和 Acrobat Reader 都可以毫无怨言地打开它.我在 PDF 规范中没有找到任何关于对外部参照流的最后两行进行特殊处理的内容.我错过了什么?

Row 0 is all zero, check. The offsets for object 1 and 2 do in fact address object 1 and 2 in the PDF file. So far, so good. Object 3 is marked unused, and for sure there is no object 3 in the PDF file.

But then, I'm a little confused that object 4, this cross-reference stream, is marked as unused. Still, since it is object 4 that I am parsing, I've clearly had no difficulty finding it.
But where I am completely confused are the rows for object 5 and 6. The "01" in the first column tells me that they are in use. But their offsets exceed the size of the entire file, and in any case, there are no object 5 nor 6 in the file. The Size entry in the dictionary clearly has a value of 7, telling me the table should contain data for objects 0 thru 6. After filtering, I have 28 bytes of data, which makes sense for seven rows of four bytes each.

Why are entries for 5 and 6 there at all? And, given that they are there, why are they marked as "in use" with apparently nonsense offsets?

The file seems valid. Both Adobe Illustrator and Acrobat Reader open it without complaint. I haven't found anything in the PDF spec about special treatment for the last two rows of an Xref stream. What am I missing?

推荐答案

您解释预测器以添加当前输入行和前一个输入行以检索当前数据行.不应该添加当前输入行和前一个数据行吗?这将改变对象 3 的结果:

You interpret the predictor to add the current input row and the previous input row to retrieve the current data row. Shouldn't you add the current input row and the previous data row? That would change results for object 3 onward:

02 00 00 00 00    00 00 00 00
02 01 19 87 6b    01 19 87 6b
02 00 00 0d 67    01 19 94 d2
02 00 00 01 8c    01 19 95 5e
02 00 00 01 0b    01 19 96 69
02 01 e7 6a 99    02 00 00 02
02 00 00 00 01    02 00 00 03

现在对象 3 和 4 具有与您的 pastebin 粘贴中的数据匹配的适当偏移量,并且对象 5 和 6 将被标记为对象流中的对象.

Now objects 3 and 4 have proper offsets matching the data from your pastebin paste and objects 5 and 6 would be marked as objects in object streams.

这篇关于试图理解 PDF 中交叉引用 (XRef) 流中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆