Python文本提取在某些pdf上不起作用 [英] Python text extraction does not work on some pdfs

查看:221
本文介绍了Python文本提取在某些pdf上不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过url阅读pdf.我遵循了许多stackoverflow建议,并使用PyPdf2 FileReader从pdf中提取文本. 我的代码如下:

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this :

url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)

print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()

我能够成功提取第一个链接的文本.但是,如果我对第二个pdf使用相同的程序.我没有收到任何文字.页码和文档信息似乎显示出来.

I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.

我尝试通过终端从Pdfminer中提取文本,并且能够从第二个pdf中提取文本.

I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.

任何想法都知道pdf有什么问题吗,或者我使用的库有缺点吗?

Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?

推荐答案

如果您阅读了pyPDF文档中的注释,您会发现它写在这里,对于某些PDF文件来说,此功能不能很好地工作.换句话说,您正在查看对该库的限制.

If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.

看两个PDF文件,我看不到文件本身有什么问题.但是...

Looking at the two PDF files, I can't see anything wrong with the files themselves. But...

第一个文件包含完全嵌入的字体 第二个文件包含子集的字体

The first file contains fully embedded fonts The second file contains subsetted fonts

这意味着第二个文件更难以从中提取文本,并且库可能不正确地支持该文件.仅供参考,我使用callas pdfToolbox(谨慎,该工具隶属于该工具)进行了文本提取,该工具使用Acrobat文本提取,并且已正确提取了两个文件的文本(确认不是问题的PDF文件)

This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).

这篇关于Python文本提取在某些pdf上不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆