如果识别PDF文档中的文本结构非常困难,那么PDF阅读器如何做得如此之好? [英] If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

查看:136
本文介绍了如果识别PDF文档中的文本结构非常困难,那么PDF阅读器如何做得如此之好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试编写一个简单的控制台应用程序或PowerShell脚本来从大量PDF文档中提取文本。有几个库和CLI工具可以实现这一点,但事实证明,没有一个能够可靠地识别文档结构。特别是我关注文本列的识别。即使非常昂贵的PDFLib TET工具也经常混淆两个相邻文本列的内容。

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.

经常注意到PDF格式没有列的任何概念,甚至没有单词。关于SO的类似问题的几个答案提到了这一点。问题是如此之大,以至于它甚至需要学术研究。 此期刊文章说明:

It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:


PDF文件中的所有数据对象都以
以视觉为导向的方式表示,作为一系列运算符...通常
不传达有关更高级别文本单位的信息,例如
令牌,行或列 - 有关这些
单位之间边界的信息只能通过空格隐式提供

All data objects in a PDF file are represented in a visually-oriented way, as a sequence of operators which...generally do not convey information about higher level text units such as tokens, lines, or columns—information about boundaries between such units is only available implicitly through whitespace

因此,我尝试过的所有提取工具(iTextSharp,PDFLib TET和Python PDFMiner)都无法识别文本列边界。在这些工具中,PDFLib TET表现最佳。

Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.

然而,非常轻量级的开源PDF阅读器SumatraPDF以及其他许多类似的人可以完美地识别列和文本区域。如果我在其中一个应用程序中打开文档,选择页面上的所有文本(甚至整个文档用CTRL + A)复制并粘贴到文本文件中,文本将以正确的顺序呈现几乎完美无缺。它偶尔会将页脚和标题文本混合到其中一列中。

However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.

所以我的问题是,这些应用程序如何做看似困难的事情(即使是昂贵的工具)像PDFLib)?

So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?

编辑2014年3月31日:值得一提的是,我发现PDFBox在文本提取方面比iTextSharp好得多(尽管有一个定制的策略实现),PDFLib TET略胜一筹比PDFBox,但它相当昂贵。 Python PDFMiner是没有希望的。我见过的最好的结果来自谷歌。可以将PDF(每次2GB)上传到Google云端硬盘,然后将其作为文本下载。这就是我在做的事情。我写了一个小工具,将我的PDF分成10个页面文件(Google只会转换前10个页面),然后在下载后将它们重新拼接在一起。

EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.

编辑7 2014年4月。取消我的上一次。最好的提取是通过MS Word实现的。这可以在Acrobat Pro中自动执行(工具>操作向导>创建新操作)。可以使用.NET OpenXml库自动化Word到文本。 这是一个将要做的课程提取(docx到txt)非常整齐。我的初始测试发现MS Word转换在文档结构方面要准确得多,但是一旦转换为纯文本就不那么重要了。

EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.

推荐答案

我曾经写过一个算法,它完全按照你提到的PDF编辑器产品完成了你仍然使用的第一个PDF编辑器今天。你提到的(我认为)有几个原因,但重要的是焦点。

I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.

你是正确的PDF(通常)不包含任何结构信息。 PDF对页面的可视化表示感兴趣,而不一定是页面意味着。这意味着它最纯粹的形式不需要有关行,段落,列或类似内容的信息。实际上,它甚至不需要有关文本本身的信息,并且有大量的PDF文件,你甚至无法复制和粘贴文本而不会出现乱码。

You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.

因此,如果您希望能够提取格式化文本,您必须确实查看页面上的所有文本片段,也可能考虑到一些线条艺术信息,并且您必须将它们分割重新走到一起。通常情况下,通过编写一个查看空白区域的引擎,然后首先决定什么是线条,什么是段落等等。众所周知,表格很难,因为它们非常多样化。

So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.

替代策略可能是:


  • 查看某些 PDF文件中提供的一些结构信息。某些PDF / A文件和所有PDF / UA文件(用于存档的PDF和用于通用辅助功能的PDF)必须具有可以很好地用于检索结构的结构信息。其他PDF文件也可能包含这些信息。

  • 查看PDF文档的创建者并使用特定的算法来很好地处理这些PDF。如果您知道自己只对Word感兴趣,或者您知道99%的PDF将来自Word 2011,那么使用这些知识可能是值得的。

  • Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
  • Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.

那么为什么有些产品比其他产品更好?专注我猜。 PDF规范非常广泛,有些工具更侧重于较低级别的PDF任务,更多关注更高级别的PDF任务。一些用于办公室使用 - 一些用于图形艺术使用。根据你的注意力,你可能会决定某个特征是否值得关注。

So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.

此外,这可能看起来像一个糟糕的答案,但我相信它是实际上,这是一个算法上难以解决的问题,只需要一位天才开发人员就可以实现比市场上的普通产品好得多的算法。这是其中一个领域 - 如果你很聪明,而且你有足够的注意力集中注意力,特别是如果你很清楚目标市场是什么,那么你就是这样做的 - 你会做对的,而其他人都会让它变得平庸。

Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.

(不,当我编写代码时,我当时没有得到它 - 我们从来没有足够专注于跟进并制作非常好的东西)

(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)

这篇关于如果识别PDF文档中的文本结构非常困难,那么PDF阅读器如何做得如此之好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆