如何找到所有基于图像的PDF? [英] How do I find all image-based PDFs?

查看:124
本文介绍了如何找到所有基于图像的PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的系统中有很多PDF文档,有时我注意到文档是基于图像的,没有编辑功能。
在这种情况下,我做OCR更好地在Foxit PhantomPDF中搜索,您可以在多个文件中执行OCR。
我想找到我的所有基于图像的PDF文档。



我不明白PDF阅读器如何识别文档的OCR不是文字。这些读者必须有一些领域。
这也可以在终端中访问。
这个答案给出了如何在线程中打开的建议检查PDF文件是否是扫描的


你最好的选择是检查它是否有文本,还可以看到
是否包含大页面化图像或许多平铺的图像
覆盖页面。如果你也检查元数据,这应该涵盖大多数
选项。


我想更好地了解如何做到这一点有效地,因为如果存在一些metafield,那将很容易。
但是,我没有找到这样的metafield。
我认为最有可能的方法是查看该页面是否包含具有搜索OCR的页面化图像,因为它已经在一些PDF阅读器中有效并被使用了。
但是,我不知道该怎么做。



关于答案的边缘检测和休息转换



在Hugh变换中,在参数空间的超平方中有具体选择的参数。其复杂度为$ O(A ^ {m-2})$其中m是您看到的参数的数量超过那个参数问题是困难的。 A 是图像空间的大小。 Foxit Reader在其实现中使用了最多3个参数。边缘易于检测,这可以确保效率,并且必须在Hugh变换之前完成。损坏的页面被忽略。其他两个参数仍然是未知的,但我认为它们必须是节点和一些交点。这些交叉点如何计算是未知的?确切的问题的描述是未知的。



测试Deajan的回答



该命令在Debian 8.5中工作,但我无法在Ubuntu 16.04中最初使用它。

  masi @ masi:〜$ find ./ -name* .pdf-print0 | xargs -0 -I {} bash -c'export file ={};如果[$(pdffonts$ file2> / dev / null | wc -l)-lt 3];然后回显$ file; fi'
./Downloads/596P.pdf
./Downloads/20160406115732.pdf
^ C

操作系统:Debian 8.5 64位

Linux内核:4.6的backports

硬件:华硕Zenbook UX303UA

解决方案

为派对迟到,这里是一个简单的解决方案,意味着已经包含字体的pdf文件不仅仅是基于图像的:

  find ./ -name* .pdf-print0 | xargs -0 -I {} \ 
bash -c'export file ={}; \
if [$(pdffonts$ file2> / dev / null | \
wc -l)-lt 3];然后回显$ file; f'




  • pdffonts列出PDF文件中的所有嵌入字体。如果包含可搜索的文本,它也必须包含字体,所以pdffonts将列出它们。检查结果是否少于三行是因为pdffonts的头是2行。所以所有结果低于3行都没有嵌入字体。 AFAIK,不应该有假阳性,这更像是问pdffonts开发者的问题。



作为单行

   

说明:
pdffonts file.pdf 将显示超过2行。
输出不包含文本的所有pdf文件的文件名。



我的具有相同功能的OCR项目在Github中 deajan / pmOCR


I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit PhantomPDF where you can do OCR in multiple files. I would like to find all PDF documents of mine which are image-based.

I do not understand how the PDF reader can recognize that the document's OCR is not textual. There must be some fields which these readers access. This can be accessed in terminal too. This answer gives open proposals how to do it in the thread Check if a PDF file is a scanned one:

Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.

I would like to understand better how you can do this effectively, since if there exists some metafield, then it would be easy. However, I have not found such a metafield. I think the most probable approach is to see if the page contains pagesized image which has OCR for search because it is effective and used in some PDF readers already. However, I do not know how to do it.

Edge Detection and Hugh Transform about the answer

In Hugh transform, there are specifically chosen parameters in the hyper-square of the parameter space. Its complexity $O(A^{m-2})$ where m is the amount of parameters where you see that with more than there parameters the problem is difficult. A is the size of the image space. Foxit reader is using most probably 3 parameters in their implementation. Edges are easy to detect well which can ensure the efficiency and must be done before Hugh transform. Corrupted pages are simply ignored. Other two parameters are still unknown but I think they must be nodes and some intersections. How these intersections are computed is unknown? The formulation of the exact problem is unknown.

Testing Deajan's answer

The command works in Debian 8.5 but I could not manage to get it work initially in Ubuntu 16.04

masi@masi:~$ find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
./Downloads/596P.pdf
./Downloads/20160406115732.pdf
^C

OS: Debian 8.5 64 bit
Linux kernel: 4.6 of backports
Hardware: Asus Zenbook UX303UA

解决方案

Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:

find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
    bash -c 'export file="{}";                          \
    if [ $(pdffonts "$file" 2> /dev/null | \
    wc -l) -lt 3 ]; then echo "$file"; fi'

  • pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.

As one-liner

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

My OCR project which has the same feature is in Github deajan/pmOCR.

这篇关于如何找到所有基于图像的PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆