PDF:如何覆盖/修复扫描图像+ OCR文件中的可搜索文本? [英] PDF: How can I override/fix searchable text in a scanned image + OCR file?

查看:411
本文介绍了PDF:如何覆盖/修复扫描图像+ OCR文件中的可搜索文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个PDF文件上创建索引,该文件作为旧原始手稿的图像进行扫描,然后在Adobe Acrobat Pro中进行字符识别.问题是某些单词的间隔很滑稽,因此OCR最终有缺陷.我使用了查找并修复可疑工具,但是仍然存在问题.

I'm trying to create an index on a PDF file that I scanned as images from an old original manuscript, then put through character recognition in Adobe Acrobat Pro. The problem is some of the words were spaced funny so the OCR ended up with flaws. I used the find and fix suspects tool but there are still problems.

关键点...

在原始文档(当然还有它的图像)中,文本"示例"用有趣的空格隔开,以便Adobe将其读为三个词"示例"然后,如果没有更好的了解,该词就会为" ample "一词创建一个索引条目,该条目看起来非常有效.这是到目前为止我已经确定的文档中的几个类似问题之一(还有更多页面需要校对).

The text "FOR EXAMPLE" was spaced funny in the original document (and its image of course) so that Adobe reads it as three words "FOR EX AMPLE" which then results in an index entry for the word "ample" that looks perfectly valid if I did not know better. This is one of several similar problems with the document that I have identified so far (still more pages to proofread).

在搜索文档时,如何修复基础OCR文本,以使其在创建的索引中同时包含正确的信息 .

How can I fix the underlying OCR text so that it contains the correct information both in the created index and when searching the document.

PS:由于手稿是技术性的,并且与文本相关联,因此我不能仅切换到该文档的纯OCR文本版本.我需要保留图像并更改下面的隐藏"可搜索文本.

PS: I cannot just switch to a pure OCR text version of the document since the manuscript is technical and has lots of drawings associated with the text. I need to keep the images and alter the "hidden" searchable text underneath.

推荐答案

我发现此答案建议 ABBYY FineReader 14 (商业;我不隶属于).看起来它将处理编辑工作,然后我假定您现有的工作流程将负责编制索引. 此处是给出了更多工作流程详细信息的另一个答案(尽管是三年前).

I found this answer suggesting ABBYY FineReader 14 (commercial; I am not affiliated). It looks like it will handle the editing, after which I presume your existing workflow would take care of the indexing. Here is another answer giving some more workflow details (albeit three years ago).

另外,此问题的答案表明Perl的 CAM :: PDF

Separately, this question has answers suggesting Perl's CAM::PDF and pdftk.

这篇关于PDF:如何覆盖/修复扫描图像+ OCR文件中的可搜索文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆