使用PDFMiner解析不带/Root对象的PDF [英] Parsing a PDF with no /Root object using PDFMiner

查看:1095
本文介绍了使用PDFMiner解析不带/Root对象的PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PDFMiner python绑定从大量PDF中提取文本.我编写的模块可用于许多PDF,但是对于一部分PDF,我却得到了一些神秘的错误:

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:

ipython堆栈跟踪:

ipython stack trace:

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

当然,我立即检查了这些PDF是否已损坏,但是可以很好地读取它们.

Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.

尽管没有根对象,是否有任何方法可以读取这些PDF?我不太确定从这里去哪里.

Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.

非常感谢!

我尝试使用PyPDF尝试进行一些差异诊断.堆栈跟踪如下:

I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux建议PDFMiner在到达第一个EOF字符后停止解析.这似乎表明存在其他问题,但我却一无所知.有什么想法吗?

Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

推荐答案

有趣的问题.我进行了某种研究:

interesting problem. i had performed some kind of research:

解析pdf的功能(来自矿工的源代码):

function which parsed pdf (from miners source code):

def set_parser(self, parser):
        "Set the document to use a given PDFParser object."
        if self._parser: return
        self._parser = parser
        # Retrieve the information of each header that was appended
        # (maybe multiple times) at the end of the document.
        self.xrefs = parser.read_xref()
        for xref in self.xrefs:
            trailer = xref.get_trailer()
            if not trailer: continue
            # If there's an encryption info, remember it.
            if 'Encrypt' in trailer:
                #assert not self.encryption
                self.encryption = (list_value(trailer['ID']),
                                   dict_value(trailer['Encrypt']))
            if 'Info' in trailer:
                self.info.append(dict_value(trailer['Info']))
            if 'Root' in trailer:
                #  Every PDF file must have exactly one /Root dictionary.
                self.catalog = dict_value(trailer['Root'])
                break
        else:
            raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
        if self.catalog.get('Type') is not LITERAL_CATALOG:
            if STRICT:
                raise PDFSyntaxError('Catalog not found!')
        return

如果您在EOF方面遇到问题,则会引发另一个异常: '''来源中的另一个功能'''

if you will be have problem with EOF another exception will be raised: '''another function from source'''

def load(self, parser, debug=0):
        while 1:
            try:
                (pos, line) = parser.nextline()
                if not line.strip(): continue
            except PSEOF:
                raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
            if not line:
                raise PDFNoValidXRef('Premature eof: %r' % parser)
            if line.startswith('trailer'):
                parser.seek(pos)
                break
            f = line.strip().split(' ')
            if len(f) != 2:
                raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
            try:
                (start, nobjs) = map(long, f)
            except ValueError:
                raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
            for objid in xrange(start, start+nobjs):
                try:
                    (_, line) = parser.nextline()
                except PSEOF:
                    raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                f = line.strip().split(' ')
                if len(f) != 3:
                    raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
                (pos, genno, use) = f
                if use != 'n': continue
                self.offsets[objid] = (int(genno), long(pos))
        if 1 <= debug:
            print >>sys.stderr, 'xref objects:', self.offsets
        self.load_trailer(parser)
        return

来自Wiki的

(pdf规范): PDF文件主要由对象组成,其中有八种类型:

from wiki(pdf specs): A PDF file consists primarily of objects, of which there are eight types:

Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object

对象可以是直接的(嵌入另一个对象中)或间接的.间接对象​​用对象编号和世代编号编号. 称为外部参照表的索引表给出了每个间接对象与文件开头之间的字节偏移量.这种设计不仅可以有效地随机访问文件中的对象,还可以进行较小的更改,而无需重写整个文件(增量更新).从PDF版本1.5开始,间接对象也可以位于称为对象流的特殊流中.这种技术可以减少包含大量小型间接对象的文件的大小,对于标记PDF尤其有用.

Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.

问题是您的损坏的pdf"页面上有一些根元素".

i thk the problem is your "damaged pdf" have a few 'root elements' on the page.

Possible solution:

您可以下载源代码,并在检索外部参照对象和解析器尝试解析此对象的每个位置编写打印功能".可以确定完整的错误堆栈(在出现此错误之前).

you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).

ps:我认为这是产品中的某种错误.

ps: i think it some kind of bug in product.

这篇关于使用PDFMiner解析不带/Root对象的PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆