根据轮廓分割pdf [英] split a pdf based on outline

查看:112
本文介绍了根据轮廓分割pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用pyPdf根据轮廓分割pdf文件,其中轮廓中的每个目标都引用pdf中的不同页面.

i would like to use pyPdf to split a pdf file based on the outline where each destination in the outline refers to a different page within the pdf.

示例大纲:


main       --> points to page 1
  sect1    --> points to page 1
  sect2    --> points to page 15
  sect3    --> points to page 22

在pyPdf中很容易遍历文档的每个页面或文档大纲中的每个目标;但是,我无法弄清楚如何获得目的地指向的页码.

it is easy within pyPdf to iterate over each page of the document or each destination in the document's outline; however, i cannot figure out how to get the page number where the destination points.

有人知道如何在大纲中找到每个目的地的参考页码吗?

does anybody know how to find the referencing page number for each destination in the outline?

推荐答案

我知道了:

class Darrell(pyPdf.PdfFileReader):

    def getDestinationPageNumbers(self):
        def _setup_outline_page_ids(outline, _result=None):
            if _result is None:
                _result = {}
            for obj in outline:
                if isinstance(obj, pyPdf.pdf.Destination):
                    _result[(id(obj), obj.title)] = obj.page.idnum
                elif isinstance(obj, list):
                    _setup_outline_page_ids(obj, _result)
            return _result

        def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
            if _result is None:
                _result = {}
            if pages is None:
                _num_pages = []
                pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
            t = pages["/Type"]
            if t == "/Pages":
                for page in pages["/Kids"]:
                    _result[page.idnum] = len(_num_pages)
                    _setup_page_id_to_num(page.getObject(), _result, _num_pages)
            elif t == "/Page":
                _num_pages.append(1)
            return _result

        outline_page_ids = _setup_outline_page_ids(self.getOutlines())
        page_id_to_page_numbers = _setup_page_id_to_num()

        result = {}
        for (_, title), page_idnum in outline_page_ids.iteritems():
            result[title] = page_id_to_page_numbers.get(page_idnum, '???')
        return result

pdf = Darrell(open(PATH-TO-PDF, 'rb'))
template = '%-5s  %s'
print template % ('page', 'title')
for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
    print template % (p+1,t)

这篇关于根据轮廓分割pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆