使用PDFMiner解析不带/Root对象的PDF [英] Parsing a PDF with no /Root object using PDFMiner
问题描述
我正在尝试使用PDFMiner python绑定从大量PDF中提取文本.我编写的模块可用于许多PDF,但是对于一部分PDF,我却得到了一些神秘的错误:
I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:
ipython堆栈跟踪:
ipython stack trace:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
当然,我立即检查了这些PDF是否已损坏,但是可以很好地读取它们.
Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.
尽管没有根对象,是否有任何方法可以读取这些PDF?我不太确定从这里去哪里.
Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.
非常感谢!
我尝试使用PyPDF尝试进行一些差异诊断.堆栈跟踪如下:
I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux建议PDFMiner在到达第一个EOF字符后停止解析.这似乎表明存在其他问题,但我却一无所知.有什么想法吗?
Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?
推荐答案
有趣的问题.我进行了某种研究:
interesting problem. i had performed some kind of research:
解析pdf的功能(来自矿工的源代码):
function which parsed pdf (from miners source code):
def set_parser(self, parser):
"Set the document to use a given PDFParser object."
if self._parser: return
self._parser = parser
# Retrieve the information of each header that was appended
# (maybe multiple times) at the end of the document.
self.xrefs = parser.read_xref()
for xref in self.xrefs:
trailer = xref.get_trailer()
if not trailer: continue
# If there's an encryption info, remember it.
if 'Encrypt' in trailer:
#assert not self.encryption
self.encryption = (list_value(trailer['ID']),
dict_value(trailer['Encrypt']))
if 'Info' in trailer:
self.info.append(dict_value(trailer['Info']))
if 'Root' in trailer:
# Every PDF file must have exactly one /Root dictionary.
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
if self.catalog.get('Type') is not LITERAL_CATALOG:
if STRICT:
raise PDFSyntaxError('Catalog not found!')
return
如果您在EOF方面遇到问题,则会引发另一个异常: '''来源中的另一个功能'''
if you will be have problem with EOF another exception will be raised: '''another function from source'''
def load(self, parser, debug=0):
while 1:
try:
(pos, line) = parser.nextline()
if not line.strip(): continue
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
if not line:
raise PDFNoValidXRef('Premature eof: %r' % parser)
if line.startswith('trailer'):
parser.seek(pos)
break
f = line.strip().split(' ')
if len(f) != 2:
raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
try:
(start, nobjs) = map(long, f)
except ValueError:
raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
for objid in xrange(start, start+nobjs):
try:
(_, line) = parser.nextline()
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
f = line.strip().split(' ')
if len(f) != 3:
raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
(pos, genno, use) = f
if use != 'n': continue
self.offsets[objid] = (int(genno), long(pos))
if 1 <= debug:
print >>sys.stderr, 'xref objects:', self.offsets
self.load_trailer(parser)
return
来自Wiki的
(pdf规范): PDF文件主要由对象组成,其中有八种类型:
from wiki(pdf specs): A PDF file consists primarily of objects, of which there are eight types:
Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object
对象可以是直接的(嵌入另一个对象中)或间接的.间接对象用对象编号和世代编号编号. 称为外部参照表的索引表给出了每个间接对象与文件开头之间的字节偏移量.这种设计不仅可以有效地随机访问文件中的对象,还可以进行较小的更改,而无需重写整个文件(增量更新).从PDF版本1.5开始,间接对象也可以位于称为对象流的特殊流中.这种技术可以减少包含大量小型间接对象的文件的大小,对于标记PDF尤其有用.
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.
问题是您的损坏的pdf"页面上有一些根元素".
i thk the problem is your "damaged pdf" have a few 'root elements' on the page.
Possible solution:
您可以下载源代码,并在检索外部参照对象和解析器尝试解析此对象的每个位置编写打印功能".可以确定完整的错误堆栈(在出现此错误之前).
you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).
ps:我认为这是产品中的某种错误.
ps: i think it some kind of bug in product.
这篇关于使用PDFMiner解析不带/Root对象的PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!