Python,pyPdf,Adobe PDF OCR错误:不支持的过滤器/lzwdecode [英] Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

查看:85
本文介绍了Python,pyPdf,Adobe PDF OCR错误:不支持的过滤器/lzwdecode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的东西:python 2.6 64位(安装了pyPdf-1.13.win32.exe).机翼IDE. Windows 7 64位.

My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit.

我遇到以下错误:

NotImplementedError:不支持的过滤器/LZWDecode

NotImplementedError: unsupported filter /LZWDecode

当我运行以下代码时:

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs

filelist = os.listdir(path)

has_text_list = []
does_not_have_text_list = []

for pdf_name in filelist:
    pdf_file_with_directory = os.path.join(path, pdf_name)
    pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))

    for i in range(0, pdf.getNumPages()):
        content = pdf.getPage(i).extractText() #this is the line what done it
        does_it_have_text = re.findall(r'\w{2,}', content) 
        if does_it_have_text == []:
            does_not_have_text_list.append(pdf_name)
            print pdf_name
        else:
            has_text_list.append(pdf_name)

print does_not_have_text_list

这里有一点背景.路径中充满了pdf.其中一些是使用Adobe pdf打印机从文本文档中保存的(至少我认为这是这样做的).有些被扫描为图像.我想将它们分开,然后将OCR作为图像分开(非图像是完美的,不应该被弄乱).

Here's a little background. The path is full of pdfs. Some were saved from text documents using the Adobe pdf printer (at least I think that's how they did it). And some were scanned as images. I wanted to separate them and OCR the ones that are images (the non-image ones are perfect and ought not to be messed with).

几天前我在这里问如何做到这一点:

I asked here a few days ago how to do that:

用于PDF的OCR批处理程序

我唯一得到的解决方案是在VB中,而我只讲Python.所以我想我会尝试为自己的问题写一个答案.我的策略(反映在上面的代码中)是这样的.如果只是图像,则该正则表达式将返回一个空列表.如果包含文本,则正则表达式(例如具有2个或更多字母数字字符的任何单词)将返回一个列表,其中填充了诸如u'word'之类的东西(在python中,我认为这是一个unicode字符串).

The only respose I got was in VB, and I only speaky the python. So I figured I would try to write an answer to my own question. My strategy (reflected in the code above) is like this. If it's just an image, then that regular expression will return an empty list. If it has text, the regular expression (says any word with 2 or more alphanumeric characters) will return a list populated with stuff like u'word' (in python, I think that's a unicode string).

所以代码应该可以工作,我们可以迈出第一步,使用开源软件结束其他线程(将ocrd与成像的pdf分开),但是我不知道如何处理此过滤器错误并进行谷歌搜索没有帮助.因此,如果有人知道,将会很有帮助.

So the code should work, and we can take the first step to finish off that other thread using open source software (separating the ocrd from imaged pdfs), but I don't know how to deal with this filter error and googling wasn't helpful. So if anyone knows, would be quite helpful.

我真的不知道如何使用这些东西.我不确定pyPdf所说的过滤器是什么意思.我认为这是说它即使是ocrd也无法真正读取pdf或其他内容.有趣的是,我将一个非ocrd文件和一个ocrd pdf文件放在一个与python文件相同的文件夹中,并且仅在没有for循环的情况下可以使用,所以我不知道为什么要使用创建的for循环来进行处理过滤器错误.我将在下面发布单个代码. THX.

I don't really know how to use this stuff. I'm not sure what filter means in pyPdf speak. I think it' saying that it can't really read the pdf or something, even though it's ocrd. Funnily, I put one of the non-ocrd and one of the ocrd pdfs in the same folder as a python file and this worked on just the one without the for loop, so I don't know why doing them with the for loop created the filter errror. I'll post the single code below. THX.

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb'))

has_text_list = []
does_not_have_text_list = []

for i in range(0, pdf.getNumPages()):
    content = pdf.getPage(i).extractText()
    does_it_have_text = re.findall(r'\w{2,}', content)
      print does_it_have_text

并且可以打印内容,所以我不知道为什么我在一个而不是另一个上收到过滤器错误.当我对目录中的另一个文件(不是ocrd)运行此代码时,输​​出是一行上的空字符串,下一行上的空字符串,如下所示:

and it prints stuff, so I don't know why I get a filter error on one and not the other. When I run this code against the other file in the directory (the one that's NOT ocrd), the output is an emptry string on one line and an emptry string on the next, like so:

[]
[]

[]
[]

因此,我也不认为这是非ocrd pdf的过滤器问题.这就像我的头上一样,在这里我需要一些帮助.

So I don't guess it's a filter problem with the non-ocrd pdfs either. This is like over my head and I need some help here.

Google搜索找到了这个,但我不知道该怎么做:

Google search found this, but I don't know what to make of it:

http://vaitls.com/treas/pdf/pyPdf/filters.py

推荐答案

Replace pyPdf's filter.py with http://vaitls.com/treas/pdf/pyPdf/filters.py in your pyPdf source folder. That worked for me.

这篇关于Python,pyPdf,Adobe PDF OCR错误:不支持的过滤器/lzwdecode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆