Python - 从网页 PDF 中提取文本 [英] Python - Extracting text from webpage PDF

查看:47
本文介绍了Python - 从网页 PDF 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我遇到了一些关于将 PDF 转换为 HTML 或将它们转换为文本的帖子,但是它们都处理从保存到计算机的文件中执行此操作.有没有办法在不下载 PDF 文件本身的情况下从网页 PDF 中提取文本(因为我将通过遍历 URL 列表来为大量文件这样做)?

So I have come across a few posts that deal with converting PDF's to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is there a way to extract the text from a webpage PDF without downloading the PDF file itself (as I will be doing so for a large number of files by iterating through a list of URL's)?

我也很好奇哪个库是实现这一目标的最佳库.pdfkit、pdf2txt、pdfminer等?

I am also curious which is the best library to achieve this with. pdfkit, pdf2txt, pdfminer, etc.?

这是我将要处理的格式的示例网站:http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf

Here is an example website with the format I will be dealing with: http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf

推荐答案

您可以使用 requests 将文件作为字节流下载,并用 io.BytesIO() 包装它,就这样:

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf import PdfFileReader

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f 是一个类似于对象的文件,您可以像打开 PDF 文件一样使用它.这样文件只存在于内存中,永远不会保存在本地.

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

为了从 PDF 文件中获取文本,您可以使用 PyPdf.

In order to get text from the PDF file you can use PyPdf.

这篇关于Python - 从网页 PDF 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆