使用 Python 解析 PDF - 提取格式化文本和纯文本 [英] PDF Parsing Using Python - extracting formatted and plain texts

查看:27
本文介绍了使用 Python 解析 PDF - 提取格式化文本和纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个 PDF 库,它可以让我从 PDF 文档中提取文本.我看过 PyPDF,这可以很好地从 PDF 文档中提取文本.这样做的问题是,如果文档中有表格,表格中的文本将与文档的其余部分一起提取.这可能会带来问题,因为它会生成无用且看起来乱码的文本部分(例如,大量数字混在一起).

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together).

我想从 PDF 文档中提取文本,排除任何表格和特殊格式.有没有图书馆可以做到这一点?

I'd like to extract the text from a PDF document, excluding any tables and special formatting. Is there a library out there that does this?

推荐答案

你也可以看看PDFMiner(或对于旧版本的 Python,请参阅 PDFMinerPDFMiner).

You can also take a look at PDFMiner (or for older versions of Python see PDFMiner and PDFMiner).

PDFMiner 的一个特别有趣的功能是您可以控制它在提取文本部分时如何重新组合它们.您可以通过指定行、单词、字符等之间的空格来实现这一点.因此,也许通过调整它可以实现您想要的(这取决于文档的可变性).PDFMiner 还可以给你文本在页面中的位置,它可以通过对象 ID 和其他东西提取数据.所以挖掘 PDFMiner 并发挥创意吧!

A particular feature of interest in PDFMiner is that you can control how it regroups text parts when extracting them. You do this by specifying the space between lines, words, characters, etc. So, maybe by tweaking this you can achieve what you want (that depends of the variability of your documents). PDFMiner can also give you the location of the text in the page, it can extract data by Object ID and other stuff. So dig in PDFMiner and be creative!

但是您的问题确实不容易解决,因为在 PDF 中,文本不是连续的,而是由页面中绝对定位的许多小字符组组成的.PDF 的重点是保持布局完整.它不是面向内容,而是面向展示.

But your problem is really not an easy one to solve because, in a PDF, the text is not continuous, but made from a lot of small groups of characters positioned absolutely in the page. The focus of PDF is to keep the layout intact. It's not content oriented but presentation oriented.

这篇关于使用 Python 解析 PDF - 提取格式化文本和纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆