使用pdfminer从pdf提取文本可得到多个副本 [英] extraction of text from pdf with pdfminer gives multiple copies

查看:107
本文介绍了使用pdfminer从pdf提取文本可得到多个副本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PDFMiner从PDF文件中提取文本(该代码位于

I am trying to extract text from a PDF file using PDFMiner (the code found at Extracting text from a PDF file using PDFMiner in python?). I didn't change the code except path/to/pdf. Surprisingly, the code returns several copies of the same document. I got the same result with other pdf files. Do I need to pass other arguments or I am missing something? Any help is highly appreciated. Just in case, I provide the code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

print convert_pdf_to_txt("test.pdf")

推荐答案

我的答案在您所引用的线程中有点不正确.我发现了错误,却忘了更新答案.

My answer was a bit incorrect in the thread that you are referencing. I found the bug and forgot to update my answer.

因为pdfminer的文档非常稀疏,所以我无法完全解释为什么这样做会起作用.希望了解pdfminer库的人能对我们有所帮助.

Because the documentation is pretty sparse with pdfminer, I'm not able to fully explain why this works the way it does. Hopefully someone who knows the pdfminer library a bit better can give us some insight.

我所知道的是,您必须在for循环之外执行text = retstr.getvalue().我只能假定正在更新retstr,就像在for循环中正在执行final_text += text一样,因此一旦完成所有操作,我们只需要执行text = retstr.getvalue()即可从所有页面中获取文本.

All I know is that you have to do text = retstr.getvalue() outside of the for loop. I can only assume that retstr is being updated as if we were doing final_text += text inside the for loop, so once it's all finished we just have to do text = retstr.getvalue() to get the text from all the pages.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,password=password,caching=caching, check_extractable=True):

        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

print convert_pdf_to_txt("test.pdf")

希望这对您有帮助!

这篇关于使用pdfminer从pdf提取文本可得到多个副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆