用于间接对象提取的 pyPdf [英] pyPdf for IndirectObject extraction

查看:23
本文介绍了用于间接对象提取的 pyPdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照这个例子,我可以将所有元素列出到一个 pdf 文件中

Following this example, I can list all elements into a pdf file

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

现在,我需要从 pdf 文件中提取一个非标准对象.

now, I need to extract a non-standard object from the pdf file.

我的对象是名为 MYOBJECT 的对象,它是一个字符串.

My object is the one named MYOBJECT and it is a string.

我关心的python脚本打印出来的那一段是:

The piece printed by the python script that concernes me is:

{'/MYOBJECT': IndirectObject(584, 0)}

pdf文件是这样的:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

如何按照 584 值来引用我的字符串(当然在 pyPdf 下)??

How can I follow the 584 value in order to refer to my string (under pyPdf of course)??

推荐答案

pdf.pages 中的每个元素都是一个字典,所以假设它在第 1 页,pdf.pages[0]['/MYOBJECT'] 应该是你想要的元素.

each element in pdf.pages is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT'] should be the element you want.

您可以尝试单独打印或在 python 提示中使用 helpdir 戳它以了解有关如何获取所需字符串的更多信息

You can try to print that individually or poke at it with help and dir in a python prompt for more about how to get the string you want

收到pdf的副本后,我在pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT找到了对象'] 并且可以通过 getData() 获取值

after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] and the value can be retrieved via getData()

以下函数提供了一种更通用的方法来通过递归查找有问题的键来解决此问题

the following function gives a more generic way to solve this by recursively looking for the key in question

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()

这篇关于用于间接对象提取的 pyPdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆