struct.error:解压缩需要长度为16的字符串参数 [英] struct.error: unpack requires a string argument of length 16

查看:133
本文介绍了struct.error:解压缩需要长度为16的字符串参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在处理带有pdfminer(pdf2txt.py)的PDF 文件(2.pdf)时,以下错误:

pdf2txt.py 2.pdf 

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

虽然类似的文件(1.pdf)不会造成问题.

我找不到有关该错误的任何信息.我在pdfminer GitHub存储库上添加了问题,但仍未得到答复.有人可以向我解释为什么会这样吗?我该怎么做才能解析 2.pdf ?


更新:在BytesIO而不是StringIO遇到了类似的错误" rel ="nofollow noreferrer">直接从GitHub存储库安装pdfminer .

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

解决方案

TL; DR

感谢@mkl和@hynecker提供额外的信息...这样,我可以确认这是pdfminer和您的PDF中的错误.每当pdfminer尝试获取嵌入式文件流(例如,字体定义)时,它都会在文件中找到endobj之前的最后一个.遗憾的是,并非所有PDF都严格添加了结束标签,因此pdfminer应该对此具有弹性.

此问题的快速解决方法

我创建了一个补丁-已作为请求提交到github上.参见 https://github.com/euske/pdfminer/pull/159 .. >

详细诊断

如其他答案中所述,您看到的原因是由于pdfminer正在解压缩数据时,您没有从流中获得预期的字节数.但是为什么呢?

正如您在堆栈跟踪中所看到的,pdfminer(正确地)发现它具有要处理的CID字体.然后,它继续将嵌入式字体文件处理为TrueType字体(在pdffont.py中).它尝试通过读取一组二进制表来解析关联的流(流ID 18).

这不适用于2.pdf,因为它具有文本流.您可以通过运行dumppdf -b -i 18 2.pdf看到此内容.我从这里开始:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

那么,垃圾进来,垃圾出...这是文件或pdfminer中的错误吗?好吧,其他读者可以处理的事实使我感到怀疑.

深入研究,我发现此流与ID 17(与ToUnicode字段的cmap映射)相同.快速浏览 PDF规范会发现它们不能相同.

进一步查看代码,我发现所有流都在获取相同的数据.糟糕!这是错误.原因似乎与以下事实有关:该PDF缺少某些结束标签-如@hynecker所述.

解决方法是为每个流返回正确的数据.仅仅吞下该错误的任何其他修复方法都将导致所有流使用错误的数据,例如,错误的字体定义.

我相信随附的补丁程序可以解决您的问题,并且通常可以安全使用.

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:

pdf2txt.py 2.pdf 

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

While the similar file (1.pdf) doesn't cause a problem.

I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?


Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

解决方案

TL; DR

Thanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.

Quick fix for this issue

I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.

Detailed diagnosis

As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?

As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.

This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.

Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.

The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.

I believe the attached patch will fix your problem and should be safe to use in general.

这篇关于struct.error:解压缩需要长度为16的字符串参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆