如何使用Python从PDF的特定区域提取文本? [英] How to extract text from a Specific Area in a PDF using Python?

查看:2006
本文介绍了如何使用Python从PDF的特定区域提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python从PDF提取文本,并且我已经成功地使用PyPDF2这样完成了操作:

I'm trying to extract Text from a PDF using Python, and I have successfully done so using PyPDF2 like this:

import PyPDF2
pdfFileObj = open('path', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pageObj.extractText()

这将从页面中提取所有文本,但是我只想从页面左上角3'x4'的矩形区域中提取文本.

This extracts all the Text from the Page, but I want to extract the text only from a Rectangular region of 3'x4' at the top-left part of the page.

我基本上想做以下事情:

I Basically want to do something like :How-to extract text from a pdf doc within a specific rectangular region? but in Python

这可以通过PyPDF2或任何其他Python库完成吗?

Can this be done by PyPDF2 or by any other Python Library?

推荐答案

这是一个相当复杂的主题,但是有可能. 首先,您需要熟悉pdf格式的描述.

This is a rather complex topic, but it is possible. First you need to get familiar with the pdf format descripton.

例如,在此处开始.

您可以识别文本框的位置和内容并提取字符串数据.

You can identify the location and contents of the text boxes and extract the string data.

主题包含pyPdf(PyPDF2的早期版本)的示例,但语法相似.有一些示例说明如何遍历间接对象.

This topic holds examples for pyPdf, the previous version of PyPDF2, but syntax is similar. There are examples on how to iterate through the indirect objects.

函数的来源也是一个很好的起点.您使用的> pageObj.extractText().

A good place to start is also the source of the function pageObj.extractText() that you used.

如果您不限于Python:如何从PDF中提取文本?

If you are not restricted to Python: How to extract text from a PDF?

您还可以使用 iText RUPS 之类的工具进行检查pdf.它显示了内容的呈现方式和在页面上的放置方式:

You can also use a tool like iText RUPS to inspect the pdf. It shows how the content is rendered and placed on the page:

之后,您应该能够识别和处理元素并提取其内容.

Afterwards you should be able to identify and address the elements and extract their content.

这篇关于如何使用Python从PDF的特定区域提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆