使用Python进行PDF解析-提取格式化和纯文本 [英] PDF Parsing Using Python - extracting formatted and plain texts

查看:642
本文介绍了使用Python进行PDF解析-提取格式化和纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个PDF库,它将允许我从PDF文档中提取文本.我看过PyPDF,它可以很好地从PDF文档中提取文本.这样做的问题是,如果文档中有表格,则表格中的文本将与文档中其余文本一起在线提取.这可能会引起问题,因为它会产生无用的文本部分,看起来有些乱码(例如,许多数字混在一起).

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together).

我想从PDF文档中提取文本,排除任何表格和特殊格式.那里有图书馆吗?

I'd like to extract the text from a PDF document, excluding any tables and special formatting. Is there a library out there that does this?

推荐答案

您还可以查看 PDFMiner (或有关Python的旧版本,请参见 PDFMiner )

You can also take a look at PDFMiner (or for older versions of Python see PDFMiner).

PDFMiner感兴趣的一个特殊功能是,您可以控制在提取文本部分时如何重新组合文本部分.您可以通过指定行,单词,字符等之间的间距来执行此操作.因此,也许可以通过对此进行调整来实现所需的功能(取决于文档的可变性). PDFMiner还可以为您提供文本在页面中的位置,它可以按对象ID和其他内容提取数据.因此,请挖掘PDFMiner并发挥创造力!

A particular feature of interest in PDFMiner is that you can control how it regroups text parts when extracting them. You do this by specifying the space between lines, words, characters, etc. So, maybe by tweaking this you can achieve what you want (that depends of the variability of your documents). PDFMiner can also give you the location of the text in the page, it can extract data by Object ID and other stuff. So dig in PDFMiner and be creative!

但是您的问题确实不是一个容易解决的问题,因为在PDF中,文本不是连续的,而是由绝对位于页面中的许多小字符组成的. PDF的重点是保持布局完整.它不是面向内容的,而是面向呈现的.

But your problem is really not an easy one to solve because, in a PDF, the text is not continuous, but made from a lot of small groups of characters positioned absolutely in the page. The focus of PDF is to keep the layout intact. It's not content oriented but presentation oriented.

这篇关于使用Python进行PDF解析-提取格式化和纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆