PDFminer:PDFTextExtractionNotAllowed错误 [英] PDFminer: PDFTextExtractionNotAllowed Error
问题描述
我正在尝试从互联网上抓取的pdf文件中提取文本,但是当我尝试下载它们时,出现错误消息:
I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed <cStringIO.StringO object at 0x7f79137a1ab0>
我已经检查了stackoverflow和其他人出现此错误,发现他们的pdf文件已使用密码保护.但是,我可以通过Mac上的预览来访问pdf.
I've checked stackoverflow and someone else who had this error found their pdfs to be secured with a password. However, I'm able to access the pdfs through preview on my mac.
有人提到预览仍然可以查看受保护的pdf,所以我也用Adobe Acrobat Reader打开了文件,仍然可以访问pdf.
Someone mentioned that preview may view secured pdfs anyway, so I opened the files in Adobe Acrobat Reader as well and was still able to access the pdf.
这是我从以下站点下载pdf的站点中的一个示例: http://www.sophia-project.org /uploads/1/3/9/5/13955288/aristotle_firstprinciples.pdf
Here's an example from the site I'm downloading pdfs from: http://www.sophia-project.org/uploads/1/3/9/5/13955288/aristotle_firstprinciples.pdf
我发现,如果我手动打开pdf并将其作为pdf重新导出到相同的文件路径(基本上是用新"文件替换原始文件),则可以从中提取文本.我猜这与从网站上下载它们有关.我只是使用urllib来下载pdf文件,如下所示:
I discovered that if I open the pdf manually and re-export it as a pdf to the same filepath (basically replacing the original with a 'new' file), then I am able to extract text from it. I'm guessing it has something to do with downloading them from the site. I'm simply using urllib to download the pdfs as follows:
if not os.path.isfile(filepath):
print '\nDownloading pdf'
urllib.urlretrieve(link, filepath)
else:
print '\nFile {} already exists!'.format(title)
我还尝试将文件重写到新的文件路径,但是仍然导致相同的错误.
I also tried rewriting the file to a new filepath, but it still resulted in the same error.
if not os.path.isfile(filepath):
print '\nDownloading pdf'
urllib.urlretrieve(link, filepath)
with open(filepath) as f:
new_filepath = re.split(r'\.', filepath)[0] + '_.pdf'
new_f = file(new_filepath, 'w')
new_f.write(f.read())
new_f.close()
os.remove(filepath)
filepath = new_filepath
else:
print '\nFile {} already exists!'.format(title)
最后,这是我用来提取文本的功能.
Lastly, here is the function I'm using to extract the text.
def convert(fname, pages=None):
'''
Get text from pdf
'''
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
try:
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
except PDFTextExtractionNotAllowed:
print 'This pdf won\'t allow text extraction!'
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
有什么办法可以以编程方式解决此问题,而不是在预览中手动重新导出文件吗?
Is there any way I can programmatically solve this rather than manually re-exporting the files in preview?
推荐答案
PDFMiner的最新版本具有check_extractable参数.您可以在get_pages方法上使用它:
More recent versions of PDFMiner has the check_extractable parameter. You can use it on get_pages method:
fp = open(filename, 'rb')
PDFPage.get_pages(fp,check_extractable=False)
这篇关于PDFminer:PDFTextExtractionNotAllowed错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!