我如何将pdfminer用作库 [英] How do I use pdfminer as a library

查看:92
本文介绍了我如何将pdfminer用作库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 pdfminer .我可以使用pdfminer命令行工具pdf2txt.py将数据成功提取到.txt文件中.我目前正在执行此操作,然后使用python脚本清理.txt文件.我想将pdf提取过程合并到脚本中,从而节省了自己的时间.

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

当我找到此链接时,我以为自己正在从事某项工作,但是任何解决方案都没有成功.也许那里列出的功能需要再次更新,因为我使用的是pdfminer的较新版本.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

我也尝试了此处显示的功能,但是也没有用.

我尝试的另一种方法是使用os.system在脚本内调用脚本.这也不成功.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

我正在使用Python版本2.7.1和pdfminer版本20110227.

I am using Python version 2.7.1 and pdfminer version 20110227.

推荐答案

这是我最终制作的适用于我的清理版本.给定其文件名,以下代码仅以PDF形式返回该字符串.我希望这可以节省一些时间.

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

此解决方案在2013年11月 API更改之前一直有效.

This solution was valid until API changes in November 2013.

这篇关于我如何将pdfminer用作库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆