使用python解析pdf [英] parse a pdf using python

查看:110
本文介绍了使用python解析pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pdf文件.它包含四列,所有页面都没有网格线.它们是学生的标志.

I have a pdf file. It contains of four columns and all the pages don't have grid lines. They are the marks of students.

我想对这种分布进行一些分析(直方图,折线图等).

I would like to run some analysis on this distribution.(histograms, line graphs etc).

我想将此pdf文件解析为电子表格或HTML文件(然后可以很容易地对其进行解析).

I want to parse this pdf file into a Spreadsheet or an HTML file (which i can then parse very easily).

到pdf的链接是:

Pdf

这是一个公共文档,任何人都可以在该域上免费获取.

this is a public document and is available on this domain openly to anyone.

注意:我知道这可以通过以下方式完成:将文件从Adobe Reader导出为文本,然后将其导入Libre Calc或Excel.但是我想使用python脚本来做到这一点.

note: I know that this can be done by exporting the file to text from adobe reader and then import it into Libre Calc or Excel. But i want to do this using a python script.

请帮助我解决此问题. 眼镜: Windows 7的 Python 2.7

Kindly help me with this issue. specs: Windows 7 Python 2.7

推荐答案

使用 PyPDF2 :

from PyPDF2 import PdfFileReader

with open('CT1-All.pdf', 'rb') as f:
    reader = PdfFileReader(f)
    contents = reader.getPage(0).extractText().split('\n')
    pass

当您打印contents时,它将看起来像这样(我在这里进行了修整):

When you print contents, it will look like this (I have trimmed it here):

[u'Serial NoRoll NoNameCT1 Marks (50)111MA20026KARADI KALYANI212AR10029MUKESH K
MAR5', u'312MI31004DEEPAK KUMAR7', u'413AE10008FADKE PRASAD DIPAK27', u'513AE10
22RAHUL DUHAN37', u'613AE30005HIMANSHU PRABHAT26.5', u'713AE30019VISHAL KUMAR39
, u'813AG10014HEMANT17', u'913AG10028SHRESTH KR KRISHNA37.51013AG30009HITESH ME
RA33.5', u'1113AG30023RACHIT MADHUKAR40.5', u'1213AR10002ACHARY SUDHEER11', u'1
13AR10004AMAN ASHISH20.5', u'1413AR10008ANKUR44', u'1513AR10010CHUKKA SHALEM RA
U11.5', u'1613AR10012DIKKALA VIJAYA RAGHAVA20.5', u'1713AR10014HRISHABH AMRODIA
1', u'1813AR10016JAPNEET SINGH CHAHAL19.5', u'1913AR10018K VIGNESH42.5', u'2013
R10020KAARTIKEY DWIVEDI49.5', u'2113AR10024LAKSHMISRI KEERTI MANNEY49', u'2213A
10026MAJJI DINESH9.5', u'2313AR10028MOUNIKA BHUKYA17.5', u'2413AR10030PARAS PRA

这篇关于使用python解析pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆