如何获取书签的页码 [英] How to get bookmark's page number

查看:414
本文介绍了如何获取书签的页码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

from pyPdf import PdfFileReader
f = open('document.pdf', 'rb')
p = PdfFileReader(f)
o = p.getOutlines()

列表对象o由字典对象pyPdf.pdf.Destination(书签)组成,它具有许多属性,但是我找不到该书签的任何引荐页码

List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark

如何返回例如o[1]书签的页码?

How can I return page number of, let's say o[1] bookmark?

例如,o[1].page.idnum返回数字大约是PDF文档中引用的页码的三倍,我认为引用的对象要比页面小,因为在整个PDF文档轮廓上运行.page.idnum返回的数字数组是甚至与PDF文档中的实际"页码目标没有线性关系,大约是3倍

For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object smaller then page, as running .page.idnum on whole PDF document outline returns array of numbers which is not even linearly correlated with "real" page number destinations in PDF document and it's roughly multiple by ~ 3

更新:此问题与此相同:基于轮廓分割pdf 尽管我不明白作者在那里的自我回答.对我来说似乎太复杂而无法使用

Update: This question is same as this: split a pdf based on outline although I don't understand what author did in his self answer there. Seems too complicated to me to be usable

推荐答案

@theta指出"

As @theta pointed out "split a pdf based on outline" has the code required to extract page numbers. If you feel this is complicated I copied part of the code which maps page ids to page numbers and made it a function. Here is a working example that prints page number of bookmark o[0]:

from PyPDF2 import PdfFileReader


def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
    if _result is None:
        _result = {}
    if pages is None:
        _num_pages = []
        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
    t = pages["/Type"]
    if t == "/Pages":
        for page in pages["/Kids"]:
            _result[page.idnum] = len(_num_pages)
            _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
    elif t == "/Page":
        _num_pages.append(1)
    return _result
# main
f = open('document.pdf','rb')
p = PdfFileReader(f)
# map page ids to page numbers
pg_id_num_map = _setup_page_id_to_num(p)
o = p.getOutlines()
pg_num = pg_id_num_map[o[0].page.idnum] + 1
print(pg_num)

@theta可能为时已晚,但可能会帮助其他人:) btw我关于stackoverflow的第一篇文章,所以请问如果我不遵循通常的格式

probably too late for @theta but might help others :) btw my first post on stackoverflow so excuse me if I did not follow the usual format

要对此进行进一步扩展: 如果您希望在页面上获得书签的确切位置,这将使您的工作更轻松:

To extend this further: If you are looking to get the exact location on the page for a bookmark this will make your job easier:

from PyPDF2 import PdfFileReader
import PyPDF2 as pyPdf

def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
    if _result is None:
        _result = {}
    if pages is None:
        _num_pages = []
        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
    t = pages["/Type"]
    if t == "/Pages":
        for page in pages["/Kids"]:
            _result[page.idnum] = len(_num_pages)
            _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
    elif t == "/Page":
        _num_pages.append(1)
    return _result
def outlines_pg_zoom_info(outlines, pg_id_num_map, result=None):
    if result is None:
        result = dict()
    if type(outlines) == list:
        for outline in outlines:
            result = outlines_pg_zoom_info(outline, pg_id_num_map, result)
    elif type(outlines) == pyPdf.pdf.Destination:
        title = outlines['/Title']
        result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'], \
        left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))
    return result

# main
pdf_name = 'document.pdf'
f = open(pdf_name,'rb')
pdf = PdfFileReader(f)
# map page ids to page numbers
pg_id_num_map = _setup_page_id_to_num(pdf)
outlines = pdf.getOutlines()
bookmarks_info = outlines_pg_zoom_info(outlines, pg_id_num_map)
print(bookmarks_info)

注意:我的书签是分区号(例如1.1简介),我正在将书签信息映射到分区号.如果您的书签不同,请修改此部分代码:

    elif type(outlines) == pyPdf.pdf.Destination:
        title = outlines['/Title']
        result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'], \
        left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))

这篇关于如何获取书签的页码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆