使用Pypdf2从网页转换的pdf中提取文本 [英] Extract text from pdf converted from webpage using Pypdf2

查看:809
本文介绍了使用Pypdf2从网页转换的pdf中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用chrome使用另存为pdf选项将网页转换为Pdf.现在的问题是,当我使用PyPDF2从中提取数据时,它显示为Null,而在其他pdf文件中则很容易使用.我知道我可以直接从网站中提取数据,但是我想了解为什么这是行不通的.它显示了正确的页数,但是当我提取text()时,它什么也没有显示.有谁知道这是什么问题? 该页面的链接为 https://en.wikipedia.org/wiki/Rapping .我将此网页转换为pdf.

I used chrome to convert a webpage into Pdf using save as pdf option. Now the problem is that when I extract the data from it using PyPDF2, it shows Null whereas it works on other pdf files easily. I know that I can extract data directly from the website but I want to understand why this is not working. It shows the correct number of pages but when I extracttext(), it shows nothing. Does anyone know what is the problem? The link to the page is https://en.wikipedia.org/wiki/Rapping. I converted this webpage to pdf.

import PyPDF2
pdfFileObj = open('C:/Users/System/Desktop/Rapping - Wikipedia.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

推荐答案

PyPDF2对于从pdf提取文本非常不可靠.也指出了此处. 它说:

PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

虽然PyPDF2具有.extractText(),但可以在其页面对象上使用 (在此示例中未显示),它不能很好地工作.一些PDF 将返回文本,有些将返回空字符串.当你想要的时候 要从PDF中提取文本,您应该检出PDFMiner项目 反而. PDFMiner更加强大,并且是专门设计的 用于从PDF提取文本.

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

  1. 您可以改为使用

  1. You could instead install and use pdfminer using

pip install pdfminer

,或者您可以使用 xpdfreader .页面上提供了使用该实用程序的说明.

or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.

您可以从此处下载命令行工具. 并且可以使用subprocess使用pdftotext.exe实用工具.有关使用子进程的详细说明,请参见

you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here

这篇关于使用Pypdf2从网页转换的pdf中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆