使用pyPDF从文档中检索页码 [英] Retrieve page numbers from document with pyPDF

查看:245
本文介绍了使用pyPDF从文档中检索页码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在考虑与pyPdf进行一些PDF合并,但是有时输入的顺序不正确,因此我正在寻找每个页面的页码,以确定其应进入的顺序. (例如,如果有人将一本书分成20个10页的PDF,而我想将它们放回原处).

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book into 20 10-page PDFs and I want to put them back together).

我有两个问题-1.)我知道有时页码存储在文档数据中的某个位置,因为我已经看到在Adobe上呈现的PDF格式类似于[1243](150之10),但是我已将这类文档读入pyPDF,但我找不到任何指示页码的信息-该文件存储在哪里?

I have two questions - 1.) I know that sometimes the page number is stored in the document data somewhere, as I've seen PDFs that render on Adobe as something like [1243] (10 of 150), but I've read documents of this sort into pyPDF and I can't find any information indicating the page number - where is this stored?

2.)如果无法使用途径1,我想我可以遍历给定页面上的对象以尝试找到页码-可能是它自己的对象中只有一个数字.但是,我似乎找不到确定对象内容的任何明确方法.如果我运行:

2.) If avenue #1 isn't available, I think I could iterate through the objects on a given page to try to find a page number - likely it would be its own object that has a single number in it. However, I can't seem to find any clear way to determine the contents of objects. If I run:

pdf.getPage(0).getContents()

这通常会返回:

{'/Filter': '/FlateDecode'}

或它返回IndirectObject(num,num)对象的列表.我真的不知道该如何处理这些,据我所知,也没有真正的文档.有没有人熟悉这种事情,可以指出我正确的方向?

or it returns a list of IndirectObject(num, num) objects. I don't really know what to do with either of these and there's no real documentation on it as far as I can tell. Is anyone familiar with this kind of thing that could point me in the right direction?

推荐答案

有关完整文档,请参见Adobe的978页

For full documentation, see Adobe's 978-page PDF Reference. :-)

更具体地说,PDF文件包含元数据,该元数据指示PDF的物理页面如何映射到逻辑页码以及应如何格式化页码.这是您获得规范结果的地方.示例2 此页面显示了它在PDF标记中的外观.您必须先将其剔除,解析并自己执行映射.

More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

在PyPDF中,要获取此信息,请尝试作为起点:

In PyPDF, to get at this information, try, as a starting point:

pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

顺便说一句,当您看到一个IndirectObject实例时,可以调用其getObject()方法来检索所指向的实际对象.

By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

正如您所说,您的替代方法是检查文本对象并尝试找出哪个是页码.您可以为此使用page对象的extractText(),但是您将返回一个字符串,并且必须尝试从中找出页码. (当然,页码可能是罗马或字母而不是数字,有些页面可能没有编号.)相反,请看看extractText()的实际工作方式(毕竟PyPDF是用Python编写的),以及使用它作为例程的基础,该例程将单独检查页面上的每个文本对象以查看其是否像页码.警惕上面有很多页码的目录/索引页面!

Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

这篇关于使用pyPDF从文档中检索页码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆