PDFMiner-遍历页面并将其转换为文本 [英] PDFMiner - Iterating through pages and converting them to text

查看:97
本文介绍了PDFMiner-遍历页面并将其转换为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我试图从某些PDF中获取特定文本,并且我将Python与PDFMiner结合使用,但是由于

So I'm trying to get a specific bit of text out of some PDFs, and I'm using Python with PDFMiner but having some trouble due to the API changes to it that happened in November 2013. Basically, to get the part of text I want out of the PDF, I currently have to convert the entire file to text, and then use string functions to get the part I want. What I want to do is loop through each page of the PDF and convert each one to text, one by one. Then once I've found the part I want, I'll just stop it from reading that PDF.

我将发布位于文本编辑器atm中的代码,但这不是有效的版本,它是高效解决方案的一半:P

I'll post the code that's sitting in my text editor atm, but it's not the working version, it's more the half-way-to-the-efficient-solution version :P

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import LTChar, TextConverter
from pdfminer.layout import LAParams
from subprocess import call
from cStringIO import StringIO
import re
import sys
import os

argNum = len(sys.argv)
pdfLoc = str(sys.argv[1]) #CLI arguments

def convert_pdf_to_txt(path): #converts pdf to raw text (not my function)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

if (pdfLoc[-4:] == ".pdf"):
    contents = ""
    try: # Get the outlines (contents) of the document
        fp = open(pdfLoc, 'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        outlines = document.get_outlines()
        for (level,title,dest,a,se) in outlines:
            title = re.sub(r".*\s", "", title) #get raw titles, stripped of formatting
            contents += title + "\n"
    except: #if pdfMiner can't get contents then manually get contents from text conversion
        #contents = convert_pdf_to_txt(pdfLoc)
        #startToCpos = contents.find("TABLE OF CONTENTS")
        #endToCpos = contents.rfind(". . .")
        #contents = contents[startToCpos:endToCpos+8]

        fp = open(pdfLoc, 'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        pages = PDFPage(document, 3, {'Resources':'thing', 'MediaBox':'Thing'}) #God knows what's going on here
        for pageNumber, page in enumerate(pages.get_pages(PDFDocument, fp)): #The hell is the first argument?
            if pageNumber == 42:
                print "Hello"

        #for line in s:
        #   print line
        #   if (re.search("(\.\s){2,}", line) and not re.search("NOTES|SCOPE", line)):
        #       line = re.sub("(\.\s){2,}", "", line)
        #       line = re.sub("(\s?)*[0-9]*\n", "\n", line)
        #       line = re.sub("^\s", "", line)
        #       print line,


        #contents = contents.lower()
        #contents = re.sub(""", "\"", contents)
        #contents = re.sub(""", "\"", contents)
        #contents = re.sub("fi", "f", contents)
        #contents = re.sub(r"(TABLE OF CONTENTS|LIST OF TABLES|SCOPE|REFERENCED DOCUMENTS|Identification|System (o|O)verview|Document (o|O)verview|Title|Page|Table|Tab)(\n)?|\.\s?|Section|[0-9]", "", contents)
        #contents = re.sub(r"This document contains proprietary information and may not be reproduced in any form whatsoever, nor may be used by or its contents divulged to third\nparties without written permission from the ownerAll rights reservedNumber:  STP SMEDate: -Jul-Issue: A  of CMC STPNHIndustriesCLASSIFICATION\nNATO UNCLASSIFIED                  AGUSTAEUROCOPTEREUROCOPTER DEUTSCHLAND                 FOKKER", "", contents)
        #contents = re.sub(r"(\r?\n){2,}", "", contents)
        #contents = contents.lstrip()
        #contents = contents.rstrip()
    #print contents
else:
    print "Not a valid PDF file"

这是旧的实现方法(或者在至少是关于旧方法的想法,线程对我不是很有用(tbh).但是现在我必须使用PDFPage.get_pages而不是PDFDocument.get_pages,并且方法及其参数完全不同.

This is the old way of doing it (Or at least an idea of how the old way did it, the thread wasn't very useful to me tbh). But now I have to use PDFPage.get_pages instead of PDFDocument.get_pages and the methods and their arguments are completely different.

当前,我正在尝试弄清楚到底是什么"Klass"变量是我传递给

Currently, I'm trying to figure out what on earth the 'Klass' variable is that I pass to the get_pages method of PDFPage.

如果有人可以阐明API的这一部分甚至提供一个有效的示例,我将非常感谢.

If anybody could shed some light on this part of the API or even provide a working example I'd very much appreciate it.

推荐答案

尝试使用 PyPDF2 .它使用起来简单得多,并且没有像PDFMiner那样不必要的丰富功能(对您而言很好).这就是您想要的,实现起来非常简单.

Try using PyPDF2. It is a lot simpler to use and not as unnecessarily feature rich as PDFMiner (which is good in your case). Here is what you wanted and it's super simple to implement.

from PyPDF2 import PdfFileReader

PDF = PdfFileReader(file(pdf_fp, 'rb'))

if PDF.isEncrypted:
    decrypt = PDF.decrypt('')
    if decrypt == 0:
        print "Password Protected PDF: " + pdf_fp
        raise Exception("Nope")
    elif decrypt == 1 or decrypt == 2:
        print "Successfully Decrypted PDF"

for page in PDF.pages:
    print page.extractText()
    '''page.extractText() is the unicode string of the contents of the page
    And I am assuming you know how to play with a string and use regex
    If you find what you want just break like so:
    if some_condition == True:
        break'''

这篇关于PDFMiner-遍历页面并将其转换为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆