我如何将pdfminer用作库 [英] How do I use pdfminer as a library
问题描述
我正在尝试使用 pdfminer .我可以使用pdfminer命令行工具pdf2txt.py将数据成功提取到.txt文件中.我目前正在执行此操作,然后使用python脚本清理.txt文件.我想将pdf提取过程合并到脚本中,从而节省了自己的时间.
I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.
当我找到此链接时,我以为自己正在从事某项工作,但是任何解决方案都没有成功.也许那里列出的功能需要再次更新,因为我使用的是pdfminer的较新版本.
I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.
我尝试的另一种方法是使用os.system
在脚本内调用脚本.这也不成功.
Another approach I tried was to call the script within a script using os.system
. This was also unsuccessful.
我正在使用Python版本2.7.1和pdfminer版本20110227.
I am using Python version 2.7.1 and pdfminer version 20110227.
推荐答案
这是我最终制作的适用于我的清理版本.给定其文件名,以下代码仅以PDF形式返回该字符串.我希望这可以节省一些时间.
Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
此解决方案在2013年11月 API更改之前一直有效.
This solution was valid until API changes in November 2013.
这篇关于我如何将pdfminer用作库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!