如何从PDF文件中提取页码 [英] How to extract page number from PDF file

查看:1210
本文介绍了如何从PDF文件中提取页码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们探索了许多API,如tika,Pdfbox和itextpdf,以从pdf文件中提取页码,但我们无法做到这一点。在itextpdf中我们得到了PdfPageLabels.getPageLabels(reader),但是这个方法的行为并不统一。

We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.

推荐答案

你不喜欢的原因找不到任何能够从PDF中提取页码的软件很简单:PDF中不存在页码的概念。

The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.

允许我预测你的回复。

*等一下!你说,当我在Adobe Reader中打开PDF时,我可以清楚地看到文档中的页码!

是的,是的,您可以用眼睛和人类智能查看该页码,但对于机器,该数字只是在画布上绘制的一些文字。使用该文档的机器不知道页面上的所有字形,线条和形状是什么。因此,软件无法为您提供您认为是人的页码。机器不知道在哪里看!

Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!

如果您对PDF有所了解,我可以预测您的下一个回复。

If you know something about PDF, I can predict your next reply.

等一下!你说,标记PDF怎么样?没有标记PDF意味着存储文档的语义与表示一起?

是的,当PDF被标记时,一段文字知道它是标题或段落的一部分,或列表,...但标记PDF是用来定义真实内容的结构。但是,页码不是真实内容的一部分。它们被标记为工件以及页面上的页眉,页脚和其他项目,这些项目不被视为真实内容。没有办法区分页码。

Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.

那么这些页面标签是什么?你问。

好吧,页面标签是可选。它们存在于一些构思良好的PDF中,但它们将在大多数PDF格式中不存在。

Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.

这是长的回答。简短的回答很简单:你要求的东西是不可能的(一般来说,不仅仅是iText,Tika,PdfBox,或者你可能会尝试的任何其他工具)。

This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).

这篇关于如何从PDF文件中提取页码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆