如何从Python中填写的表单提取PDF字段? [英] How to extract PDF fields from a filled out form in Python?

查看:816
本文介绍了如何从Python中填写的表单提取PDF字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用Python来处理一些使用Adobe Acrobat Reader填写并签名的PDF表单。



我试过了:




  • pdfminer demo:它没有转储任何填写的数据。

  • pyPdf :当我尝试使用PdfFileReader(f)加载文件时,它最大化了一个核心2分钟,并且我放弃并杀死了它。

  • Jython和 PDFBox :得到了很好的工作,但启动时间过长,我只是写直接Java的外部工具,如果这是我的只有选项。



我可以继续寻找图书馆并试用它们,但我希望有人已经有了一个有效的解决方案。 / p>




更新:根据Steven的回答,我查看了pdfmi ner并且它很好地执行了这个技巧。

  from argparse import ArgumentParser 
import pickle
import pprint $ b $ pdf from pdfminer.pdfparser import pdfParser,PDFDocument
from pdfminer.pdftypes import resolve1,PDFObjRef
$ b $ def load_form(filename):
将pdf表单内容加载到嵌套
with open(filename,'rb')as file:
parser = PDFParser(文件)
doc = PDFDocument()
解析器。在
resolve1(doc.catalog [')中为set_document(doc)
doc.set_parser(解析器)
doc.initialize()
return [load_fields(resolve1(f)) AcroForm'])['Fields']]

def load_fields(field):
递归加载表单域
form = field.get('Kids ',None)
如果是form:
return [load_fields(resolve1(f))for f in form]
else:
#某些字段类型,如签名,需要额外的resol ving
return(field.get('T')。decode('utf-16'),resolve1(field.get('V')))

def parse_cli()
加载命令行参数
parser = ArgumentParser(description ='转储PDF的表单内容')
parser.add_argument('file',metavar =' pdf_form',
help ='PDF格式转储内容')
parser.add_argument(' - o','--out',help ='将输出写入文件',
default = None,metavar ='FILE')
parser.add_argument(' - p','--pickle',action ='store_true',default = False,
help ='格式化输出python消费')
返回parser.parse_args()

def main():
args = parse_cli()
form = load_form(args.file)
如果args.out:
with open(args.out,'w')as outfile:
如果args.pickle:
pickle.dump(form,outfile)
其他:
pp = pprint.PrettyPrinter(indent = 2)
file.write(pp.pformat(form))
else:
如果args.pickle:
打印pickle.dumps(表单)
else:
pp = pprint.PrettyPrinter(indent = 2)
pp.pprint(表单)

if __name__ ==' __main__':
main()


解决方案

您应该可以通过 pdfminer 来完成,但这需要一些深入研究pdfminer的内部和一些关于pdf格式的知识(当然也包括关于pdf的内部结构,如字典和间接对象)。



这个例子可能会帮助你(我认为它只适用于简单的情况,没有嵌套的字段等)。

 从pdfminer.pdfparser导入sys 
从pdfminer.pdfdocument导入PDFParser
从pdfminer.pdftypes导入PDFDocument
import resolve1

文件名= sys.argv [1]
fp =打开(文件名,'rb')

解析器= PDFParser(fp)
doc = PDFDocument (解析器)
fields = resolve1(doc.catalog ['AcroForm'])['Fields']
for field in field:
field = resolve1(i)
name, value = field.get('T'),field.get('V')
print'{0}:{1}'。format(name,value)

编辑:忘记提及:如果您需要提供密码,请将其传递给 doc.initialize()


I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I've tried:

  • The pdfminer demo: it didn't dump any of the filled out data.
  • pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
  • Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.


Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument()
        parser.set_document(doc)
        doc.set_parser(parser)
        doc.initialize()
        return [load_fields(resolve1(f)) for f in
                   resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-16'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
                    help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
                      default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
                      help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                file.write(pp.pformat(form))
    else:
        if args.pickle:
            print pickle.dumps(form)
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

解决方案

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

这篇关于如何从Python中填写的表单提取PDF字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆