隐藏的文本如何存储在OCR增强的PDF文件中 [英] How is hidden text stored in OCR-enhanced PDF files

查看：114 发布时间：2020/5/19 19:25:49 pdf ocr

本文介绍了隐藏的文本如何存储在OCR增强的PDF文件中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

//编辑26.03.2018-想要继续我的工作的人可以查看我的源文件 https://github.com/n0l0cale/ocr-sampledata

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata

我实际上正在寻找有关PDF文件的一些详细信息.对我来说最重要的是，这些文件将可以使用很长时间，并且如果可能的话，应将OCR自动应用于新文件(这在Adobe Acrobat中似乎不太可能...).

I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).

为此，我一直在寻找不同的解决方案，如何对我的PDF文件进行OCR.我发现三个候选人似乎正在做他们应该做的事(或多或少).但是这三个变体都有其优点和缺点...但是似乎存在不同的方法来将数据存储在PDF文件中....对于所有三个变体...让我解释一下:

For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:

使用Adobe Acrobat的文件OCRed:

a File OCRed with Adobe Acrobat:

https://github.com/n0l0cale/ocr -sampledata/blob/master/A4％20sample_ACROBAT.pdf

生成一个文件，Acrobat可以在一个步骤中打开该文件(不预加载任何背景层)，并且在执行预检脚本之后，我可以看到隐藏的文本:

results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:

带有Abby Finereader的文件OCRed:

a File OCRed with Abby Finereader:

https://github.com/n0l0cale/ocr -sampledata/blob/master/A4％20sample_ABBY.pdf

似乎不适合默认的adobe preflight-script，因为它不显示任何其他层:

does not seem suitable for the default adobe preflight-script as it does not display any additional layers:

但据我所知，这些文件似乎有一个Background-Text-Layer，其中包含OCRed Text，这是最后显示给用户的Image的基础层.不幸的是，这似乎是单独加载的，这在使用Adobe Acrobat打开文件时会造成混淆...

But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...

带有Tesseract 4(Alpha)的文件OCRed:

a File OCRed with Tesseract 4 (Alpha):

https://github.com/n0l0cale/ocr -sampledata/blob/master/A4％20sample_TESSERACT_oem2.pdf

还在隐藏文本部分做一些奇怪的魔术:

is also doing some weird magic with the hidden text part:

但是在所有三种情况下，我都可以在文件中搜索单词，并使用删除隐藏的信息"并选择隐藏的文本"来查看文本:

But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":

我很困惑....有人知道这些程序是如何真正存储其隐藏文本信息的吗?

I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?

PS:对于那些想知道这个不祥的印前检查脚本是什么的人: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/

P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/

隐藏的文本如何存储在OCR增强的PDF文件中 [英] How is hidden text stored in OCR-enhanced PDF files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

隐藏的文本如何存储在OCR增强的PDF文件中 [英] How is hidden text stored in OCR-enhanced PDF files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭