pypdf将多个pdf文件合并为一个pdf [英] pypdf Merging multiple pdf files into one pdf

查看:254
本文介绍了pypdf将多个pdf文件合并为一个pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有1000多个pdf文件需要合并为一个pdf,

If I have 1000+ pdf files need to be merged into one pdf,

input = PdfFileReader()
output = PdfFileWriter()
filename0000 ----- filename 1000
    input = PdfFileReader(file(filename, "rb"))
    pageCount = input.getNumPages()
    for iPage in range(0, pageCount):
        output.addPage(input.getPage(iPage))
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

input = PdfFileReader(file(filename500+, "rb"))时执行以上代码,

错误信息: IOError: [Errno 24] Too many open files:

我认为这是一个错误,如果没有,我该怎么办?

I think this is a bug, If not, What should I do?

推荐答案

我最近遇到了这个完全相同的问题,所以我钻入PyPDF2以查看发生了什么以及如何解决.

I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.

注意:我假设filename是格式正确的文件路径字符串.假设我所有的代码都一样

Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code

简短答案

使用PdfFileMerger()类而不是PdfFileWriter()类.我已尝试提供以下内容,以使您的内容与我的内容尽可能相似:

Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:

from PyPDF2 import PdfFileMerger, PdfFileReader

[...]

merger = PdfFileMerger()
for filename in filenames:
    merger.append(PdfFileReader(file(filename, 'rb')))

merger.write("document-output.pdf")

详细答案

您使用PdfFileReaderPdfFileWriter的方式将每个文件保持打开状态,并最终导致Python生成IOError24.更具体地说,将页面添加到PdfFileWriter时,您正在添加在打开的PdfFileReader中对页面的引用(因此,如果关闭文件,则会显示IO错误). Python重新检测到仍要引用的文件,即使重新使用文件句柄,也不会执行任何垃圾回收/自动关闭文件的操作.它们保持打开状态,直到PdfFileWriter不再需要访问它们为止,该位置位于您代码中的output.write(outputStream).

The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.

要解决此问题,请在内容的内存中创建副本,然后关闭文件.在通过PyPDF2代码进行的冒险中,我注意到PdfFileMerger()类已经具有此功能,因此,我选择重新使用它,而不是重新发明轮子.不过,我了解到,我最初对PdfFileMerger的了解还不够,并且只能在特定条件下创建副本.

To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.

我的最初尝试如下所示,并导致了相同的IO问题:

My initial attempts looked like the following, and were resulting in the same IO Problems:

merger = PdfFileMerger()
for filename in filenames:
    merger.append(filename)

merger.write(output_file_path)

查看PyPDF2源代码,我们看到append()要求传递fileobj,然后使用merge()函数,将最后一页作为新文件位置传递. merge()fileobj执行以下操作(在通过PdfFileReader(fileobj)打开它之前:

Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):

    if type(fileobj) in (str, unicode):
        fileobj = file(fileobj, 'rb')
        my_file = True
    elif type(fileobj) == file:
        fileobj.seek(0)
        filecontent = fileobj.read()
        fileobj = StringIO(filecontent)
        my_file = True
    elif type(fileobj) == PdfFileReader:
        orig_tell = fileobj.stream.tell()   
        fileobj.stream.seek(0)
        filecontent = StringIO(fileobj.stream.read())
        fileobj.stream.seek(orig_tell)
        fileobj = filecontent
        my_file = True

我们可以看到append()选项确实接受字符串,并且这样做时,假定它是文件路径并在该位置创建文件对象.最终结果与我们要避免的完全相同.按住PdfFileReader()对象打开文件,直到最终写入文件!

We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!

但是,如果我们使文件路径字符串为的文件对象为PdfFileReader (请参见编辑2) 对象,路径字符串之前传递到append()中,它将自动为我们创建一个副本作为StringIO对象,从而允许Python关闭文件.

However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.

我建议使用更简单的merger.append(file(filename, 'rb')),因为其他人报告说,即使调用writer.close()之后,PdfFileReader对象也可能在内存中保持打开状态.

I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().

希望这对您有帮助!

编辑:我假设您使用的是PyPDF2,而不是PyPDF.如果不是这样,我强烈建议您进行切换,因为不再保留PyPDF,而作者则对Phaseit开发PyPDF2表示了官方的祝福.

I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.

如果由于某些原因您无法交换到PyPDF2(许可,系统限制等),则无法使用PdfFileMerger.在这种情况下,您可以重用PyPDF2的merge函数(上面提供)中的代码来创建文件的副本作为StringIO对象,并在代码中代替文件对象使用它.

If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.

以前使用merger.append(PdfFileReader(file(filename, 'rb')))的建议已根据注释(感谢@Agostino)进行了更改.

EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks @Agostino).

这篇关于pypdf将多个pdf文件合并为一个pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆