如何从 Python 中填写的表单中提取 PDF 字段? [英] How to extract PDF fields from a filled out form in Python?

查看:20
本文介绍了如何从 Python 中填写的表单中提取 PDF 字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 处理一些使用 Adob​​e Acrobat Reader 填写和签名的 PDF 表单.

I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

我试过了:

  • pdfminer 演示:它没有转储任何已填写的数据.
  • pyPdf:当我尝试使用 PdfFileReader(f) 加载文件时,它的核心达到了 2 分钟我只是放弃并杀死了它.
  • Jython 和 PDFBox:效果很好,但启动时间过长,我会写一个如果这是我唯一的选择,则直接使用 Java 中的外部实用程序.
  • The pdfminer demo: it didn't dump any of the filled out data.
  • pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
  • Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

我可以继续寻找库并尝试使用它们,但我希望有人已经为此提供了有效的解决方案.

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.

更新:根据史蒂文的回答,我查看了 pdfminer 并且它很好地完成了这个技巧.

Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        return [load_fields(resolve1(f)) for f in
                   resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-16'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
                    help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
                      default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
                      help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                file.write(pp.pformat(form))
    else:
        if args.pickle:
            print(pickle.dumps(form))
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

推荐答案

你应该可以使用 pdfminer,但这需要深入研究 pdfminer 的内部结构和一些关于 pdf 格式的知识(当然是 wrt 形式,但也需要了解 pdf 的内部结构,如字典"和间接对象").

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

这个例子可能对你有所帮助(我认为它只适用于简单的情况,没有嵌套字段等......)

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

忘记提及:如果您需要提供密码,请将其传递给 doc.initialize()

forgot to mention: if you need to provide a password, pass it to doc.initialize()

这篇关于如何从 Python 中填写的表单中提取 PDF 字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆