我如何使用 pdfminer 作为库 [英] How do I use pdfminer as a library

查看:42
本文介绍了我如何使用 pdfminer 作为库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 pdfminer.我能够使用 pdfminer 命令行工具 pdf2txt.py 成功地将此数据提取到 .txt 文件.我目前这样做,然后使用 python 脚本来清理 .txt 文件.我想将pdf提取过程合并到脚本中并为自己节省一步.

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

当我发现这个链接时,我以为我在做某事,但我的任何解决方案都没有成功.也许那里列出的功能需要再次更新,因为我使用的是较新版本的 pdfminer.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

我也尝试过这里显示的功能,但它也没有用.

我尝试的另一种方法是使用 os.system 在脚本中调用脚本.这也没有成功.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

我使用的是 Python 版本 2.7.1 和 pdfminer 版本 20110227.

I am using Python version 2.7.1 and pdfminer version 20110227.

推荐答案

这是我最终制作的对我有用的清理版本.下面只是简单地返回 PDF 中的字符串,给定它的文件名.我希望这可以节省一些时间.

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

此解决方案在 API 于 2013 年 11 月更改之前一直有效.

This solution was valid until API changes in November 2013.

这篇关于我如何使用 pdfminer 作为库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆