在Python中从PDF提取文本 [英] Extracting text from PDF in Python

查看：420 发布时间：2020/5/25 5:09:41 python pdf pypdf2

本文介绍了在Python中从PDF提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含引号的PDF:

I have a PDF full of quotes:

https://www.pdf-archive.com/2017/03/22/test/

我可以使用以下代码在python中提取文本:

I can extract the text in python using the following code:

import PyPDF2

pdfFileObj = open('example.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
print (pageObj.extractText())

这会将所有引号作为一个段落返回.是否可以通过水平分隔符将pdf文件拆分"并以这种方式将其拆分为引号?

This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?

推荐答案

如果只想从pdf文本中提取引号，则可以使用regex查找所有引号.

If you want to just extract the quotes from the pdf text you can use regex to find all the quotes.

import PyPDF2
import re
pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
text = str(pageObj.extractText())

quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
    print quote
    print

或者只是

quotes = re.findall(r'"[^"]*"',text)
print quotes

这篇关于在Python中从PDF提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中从PDF提取文本 [英] Extracting text from PDF in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中从PDF提取文本 [英] Extracting text from PDF in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭