Python读取PDF文件 [英] Python to read pdf files

查看:116
本文介绍了Python读取PDF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现许多帖子都提出了阅读pdf的解决方案.我想逐字阅读pdf文件并对其进行一些处理.人们建议使用pdfMiner,它将整个pdf文件转换为文本文件.但是我想要的是逐字阅读pdf.谁能建议一个可以做到这一点的图书馆?

I have found many posts where solutions to read pdfs has been proposed. I want to read a pdf file word by word and do some processing on it. people suggest pdfMiner which converts entire pdf file into text file. But what i want is that to read pdfs word by word. Can anyone suggest a library that does this??

推荐答案

可能最快的方法是首先使用

Possibly the fastest way to do this is to first convert your pdf inta a text file using pdftotext (on pdfMiner's site, there's a statement that pdfMiner is 20 times slower than pdftotext) and afterwards parse the text file as usual.

此外,当您说我想逐字读取pdf文件并对其进行处理"时,您未指定是要基于pdf文件中的单词进行处理,还是实际上想要修改pdf文件本身.如果是第二种情况,那么您手上将面临一个完全不同的问题.

Also, when you said "I want to read a pdf file word by word and do some processing on it", you didn't specify if you want to do processing based on words in a pdf file, or do you actually want to modify the pdf file itself. If it's the second case, then you've got an entirely different problem on your hands.

这篇关于Python读取PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆