使用pdfminer水平阅读pdf文件 [英] read pdf file horizontally with pdfminer
本文介绍了使用pdfminer水平阅读pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想用pdfminer
(版本20140328)提取pdf.
I would like to extract a pdf with pdfminer
(version 20140328).
这是提取pdf的代码:
This is the code to extract the pdf:
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
import urllib2
def pdf_to_string(data):
fp = StringIO(data)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return data
pdf_url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/140836.pdf"
file_object = urllib2.urlopen(urllib2.Request(pdf_url)).read()
string=pdf_to_string(file_object)
这是pdf的屏幕截图:
This is a screenshot of the pdf:
问题是pdfminer
不会水平读取(先人然后位置),而是按列读取(然后所有人都将其各自的位置读取):
The problem is that pdfminer
doesn't read it horizontally (person then position) but in columns (all the persons then all their respective positions):
Belgium:
Mr Koen GEENS
Bulgaria:
Mr Petar CHOBANOV
Czech Republic:
Mr Radek URBAN
Minister for Finance, with responsibility for the Civil
Service
Minister for Finance
Deputy Minister for Finance
如何使pdfminer
水平阅读文本?
推荐答案
我找到了 pdftotext
的可行解决方案:
I have found a working solution with pdftotext
:
import tempfile, subprocess
def pdf_to_string(file_object):
pdfData = file_object.read()
tf = tempfile.NamedTemporaryFile()
tf.write(pdfData)
tf.seek(0)
outputTf = tempfile.NamedTemporaryFile()
if (len(pdfData) > 0) :
out, err = subprocess.Popen(["pdftotext", "-layout", tf.name, outputTf.name ]).communicate()
return outputTf.read()
else :
return None
pdf_file="files/2014_1.pdf"
file_object = file(pdf_file, 'rb')
print pdf_to_string(file_object)
这将产生正确的输出,然后是人员姓名,然后是位置:).
This produces the right output, person names then positions :).
已解决!
这篇关于使用pdfminer水平阅读pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文