在命令行上导出PDF页面标签 [英] Export PDF page labels on command line

查看:148
本文介绍了在命令行上导出PDF页面标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想导出存储在某些PDF文档中的页面标签,以便于解析。我知道我可以用 qpdf 转换后转换成PDF文档,但这看起来像是过分。



有没有命令行工具,只打印每个页面的页面标签(或与其他元数据)?我知道PDFSpy将导出标签,但$ 300不是一个选项,最好是解决方案应该是免费的。

解决方案

em> 简短答案:

我不知道任何(免费)工具可以简单打印每个页面的页面标签。 / strong>



此外,使用 qpdf 长回答:

没有这样的工具,因为这些是你可以安全依靠的几个东西,当涉及到页面标签。这些是以下内容:


  1. 每个PDF文档必须包含根对象。

  2. 根对象必须

  3. 使用键 / Root 显示在哪里找到对象,后跟间接对象号引用。

  4. 如果如果PDF文档使用非标准页面标签,则文档根对象必须具有 / PageLabels

这里是停止比较容易的地方。因为 / PageLabels 键引用的对象可以包含在压缩对象中。这意味着你必须扩展该对象流。



如果你真的成功地获得了页面标签的描述为ASCII,你会发现它不是一个容易解析的平面列表(像字典是):它是一个数字树。



我不会深入这些复杂性的细节,因为它需要一个很长的文章来描述所有可能的变化。您最好直接在 官方ISO PDF-1.7中阅读规格



但我会给你一个ASCII PDF代码示例:

  213 0 obj 
<< / Type / Catalog
/ PageLabels
<<
/ Nums
[
0<< %开始标签从页码。 1
/ S / r%带小写罗马数字的标签
>>
7<< %从页码开始新标签。 8
/ S / D%带标准十进制数的标签
>>
11<< %start标签页号。 12
/ S / D%带十进制数字的标签...
/ P(ABCD-)%...但使用标签前缀'ABCD -'...
/ St 3% ...后跟3作为开始十进制。
>>
]
>>
%% ...........................
%% ...更多根对象键...
%% ...........................
>>
endobj

上述示例将标记页码1,2,3 ,...(last):

  i 
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...等等,直到最后一页...


b $ b

如您所见,标记页面的PDF方法(将页面编号映射到页面名称)是完全不直观的。你只能通过学习PDF规范来理解它。


I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.

Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.

解决方案

Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.

Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.

Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:

  1. Each PDF document must contain a root object.
  2. That root object must be of /Type /Catalog.
  3. The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
  4. IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.

Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.

If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.

I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.

But instead I'll give you an example in ASCII PDF code:

213 0 obj
  << /Type /Catalog
     /PageLabels 
        << 
           /Nums 
                 [ 
                   0 <<           % start labeling from page no. 1
                       /S /r      % label with lowercase roman numbers
                     >> 
                   7 <<           % start new labeling from page no. 8
                       /S /D      % label with standard decimal numbers
                     >> 
                   11 <<          % start labeling page no. 12
                       /S /D      % label with decimal numbers...
                       /P (ABCD-) %   ...but using label prefix 'ABCD-'...
                       /St 3      %   ...followed by '3' as the start decimal.
                     >>
                  ]
        >>
     %%...........................
     %%...more root object keys...
     %%........................... 
  >>
endobj

The above example will label the pages number 1, 2, 3, ... (last) like this:

i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...

As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.

这篇关于在命令行上导出PDF页面标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆