在合并PDF的同时保留自定义页码(也称为页面标签)和书签 [英] Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

查看:581
本文介绍了在合并PDF的同时保留自定义页码(也称为页面标签)和书签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动合并多个PDF文件,并且有两个要求:a)现有书签,以及b)需要保留页面标签(自定义页面编号).

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.

默认情况下,PyPDF2和pdftk合并时会保留书签,而pdfrw则不会. 始终不会在PyPDF2,pdftk或pdfrw中保留页面标签.

Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw. Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.

经过大量的搜索,我猜测没有一种直接的方法可以做我想做的事情.如果我错了,那么我希望有人可以指出这个简单的解决方案.但是,如果没有简单的解决方案,那么在python中实现该技巧的任何提示将不胜感激!

I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!

一些示例代码:

1)使用PyPDF2

1) With PyPDF2

from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader 
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand

所以我不知道如何从这里开始

So I dont know how to proceed from here

2)使用pdfrw(有更多希望)

2) With pdfrw (has more promise)

from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good

来自第二个文件的书签的页码需要在添加它们之前偏移,但是在阅读轮廓时,我几乎总是得到(IndirectObject,XXX)而不是页码.目前尚不清楚如何使用pdfrw获取每个标签和书签的页码.所以,我又被困住了

Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again

zp

推荐答案

您需要迭代现有的PageLabels并将其添加到合并的输出中,请注意根据已添加的页面数.

You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.

此解决方案还需要PyPDF4,因为PyPDF2会产生奇怪的错误(请参见底部).

This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).

from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader 

# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF

import logging

def add_nums(num_entry, page_offset, nums_array):
    for num in num_entry['/Nums']:
        if isinstance(num, (int)):
            logging.debug("Found page number %s, offset %s: ", num, page_offset)

            # Add the physical page information
            nums_array.append(PDF.NumberObject(num+page_offset))
        else:
            # {'/S': '/r'}, or {'/S': '/D', '/St': 489}
            keys = num.keys()
            logging.debug("Found page label, keys: %s", keys)
            number_type = PDF.DictionaryObject()
            # Always copy the /S entry
            s_entry = num['/S']
            number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
            logging.debug("Adding /S entry: %s", s_entry)

            if '/St' in keys:
                # If there is an /St entry, fetch it
                pdf_label_offset = num['/St']
                # and add the new offset to it
                logging.debug("Found /St %s", pdf_label_offset)
                number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})

            # Add the label information
            nums_array.append(number_type)

    return nums_array

def write_merged(pdf_readers):
    # Output
    merger = PdfFileMerger()

    # For PageLabels information
    page_labels = []
    page_offset = 0
    nums_array = PDF.ArrayObject()

    # Iterate through all the inputs
    for pdf_reader in pdf_readers:
        try:
            # Merge the content
            merger.append(pdf_reader)

            # Handle the PageLabels
            # Fetch page information
            old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
            page_count = pdf_reader.getNumPages()

            # Add PageLabel information
            add_nums(old_page_labels, page_offset, nums_array)
            page_offset = page_offset + page_count

        except Exception as err:
            print("ERROR: %s" % err)

    # Add PageLabels
    page_numbers = PDF.DictionaryObject()
    page_numbers.update({PDF.NameObject("/Nums"): nums_array})

    page_labels = PDF.DictionaryObject()
    page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})

    root_obj = merger.output._root_object
    root_obj.update(page_labels)

    # Write output
    merger.write('merged.pdf')


pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)

write_merged(pdf_readers)

注意:PyPDF2产生此奇怪的错误:

Note: PyPDF2 produces this weird error:

  ...
  ...
  File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
    data[key] = value
  File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
    raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

这篇关于在合并PDF的同时保留自定义页码(也称为页面标签)和书签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆