Python文本提取在某些pdf上不起作用 [英] Python text extraction does not work on some pdfs

查看：221 发布时间：2020/5/25 4:29:40 python pdf web-scraping pypdf pdfminer

本文介绍了Python文本提取在某些pdf上不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过url阅读pdf.我遵循了许多stackoverflow建议，并使用PyPdf2 FileReader从pdf中提取文本. 我的代码如下:

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this :

url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)

print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()

我能够成功提取第一个链接的文本.但是，如果我对第二个pdf使用相同的程序.我没有收到任何文字.页码和文档信息似乎显示出来.

I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.

我尝试通过终端从Pdfminer中提取文本，并且能够从第二个pdf中提取文本.

I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.

任何想法都知道pdf有什么问题吗，或者我使用的库有缺点吗?

Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?

Python文本提取在某些pdf上不起作用 [英] Python text extraction does not work on some pdfs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python文本提取在某些pdf上不起作用 [英] Python text extraction does not work on some pdfs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭