检查PDF文件在Python中是否有效 [英] Check whether a PDF-File is valid with Python

查看:150
本文介绍了检查PDF文件在Python中是否有效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过HTTP上载获得文件,并且需要确保它是pdf文件.编程语言是Python,但这没关系.

I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.

我想到了以下解决方案:

I thought of the following solutions:

  1. 检查字符串的第一个字节是否为%PDF". 这不是一个很好的检查,但可以防止用户意外上传其他文件.

尝试libmagic(bash上的"file"命令使用它). 此检查与(1)中的检查完全相同

Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as in (1)

获取一个lib并尝试从文件中读取页数. 如果该库能够读取一个页面计数,则它应该是有效的pdf.问题:我不知道python的库可以做到这一点

Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf. Problem: I dont know a lib for python which can do this

那么有人为lib或其他技巧找到了解决方案吗?

So anybody got any solutions for a lib or another trick?

推荐答案

两个最常用的Python PDF库是:

The two most commonly used PDF libraries for Python are:

  • pyPdf
  • ReportLab

两者都是纯python,因此应该易于安装以及跨平台.

Both are pure python so should be easy to install as well be cross-platform.

使用pyPdf可能就像这样简单:

With pyPdf it would probably be as simple as doing:

from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))

这应该足够了,但是如果您想进一步检查,doc现在将具有documentInfo()numPages()方法.

This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.

正如Carl回答的那样,pdftotext也是一个很好的解决方案,并且在非常大的文档(尤其是具有很多交叉引用的文档)中可能会更快.但是,由于分叉新进程的系统开销等原因,在小PDF上可能会稍慢一些.

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

这篇关于检查PDF文件在Python中是否有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆