如何知道PDF仅包含图像还是经过OCR扫描以进行搜索? [英] How to know if a PDF contains only images or has been OCR scanned for searching?

查看:108
本文介绍了如何知道PDF仅包含图像还是经过OCR扫描以进行搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆来自扫描文档的PDF文件.这些文件包含图像和文本的混合.有些被扫描为没有OCR的图像,因此即使整个页面都是纯文本,每个PDF页面都是一张大图像.其他的则使用OCR进行扫描,并包含图像和存在文本的可搜索文本.在许多情况下,甚至图像中的单词都可以搜索.

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

我想通过Acrobat 8​​ Pro使用OCR进行自动化处理,以识别所有扫描文档中的文本,但是我不想在OCR中重新对已经通过OCR处理的文件进行OCR.过去的.有人知道是否有办法分辨出哪些只包含图像,哪些已经包含可搜索文本吗?

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

我打算在C#或VB.NET中进行此操作,但我认为能否区分两种文件取决于语言.

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

推荐答案

转换为PDF的扫描图像在后期经过OCR处理以使文本可搜索,但这些图像通常包含呈现为不可见"的文本部分.因此,您在屏幕上(或打印时在纸上)看到的仍然是原始图像.但是,当您成功 搜索 时,您将获得不可见文本上突出显示的匹配.

Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.

我建议您查看XPDF衍生的命令行工具pdffonts(.exe)pdfinfo(.exe)pdftotext(.exe).请参阅此处进行下载: http://www.foolabs.com/xpdf/download.html

I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html

pdffonts的用法示例:

Example usage of pdffonts:

C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique          Type 1C           yes yes no   13171  0
LGOKGM+Univers-Black                 Type 1C           yes yes no   13172  0
[....]

此PDF使用字体(由名称"列指示),嵌入字体(由"emb"列中的是"指示)并使用子集字体(由子"中的是"指示)列).

This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).

C:\downloads\> pdffonts examle1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique                 Type 1C           yes no  no   14    0
Arial                                TrueType          no  no  no   15    0

此PDF使用2种字体(在名称"列中指示).字体"Universe-BlackOblique"已完全嵌入(由"emb"列中的是"和"sub"列中的否"指示).还使用了"Arial"字体,但未嵌入.

This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.

C:\downloads\> pdffonts examle2.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

此PDF不使用单一字体,因此没有嵌入任何文本(因此也没有OCR).

This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).

pdftotext的用法示例:

Example usage of pdftotext:

C:\downloads\> pdftotext ^
                   -layout ^
                   cisco-ip-phone-7911-guide6.1.pdf ^
                   cisco-ip-phone-7911-guide6.1.txt

这将从PDF中提取所有文本字符串(试图保留与原始布局的相似之处).如果PDF中没有文本,您将知道没有OCR ...

This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...

这篇关于如何知道PDF仅包含图像还是经过OCR扫描以进行搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆