尚未进行OCR的批量OCRing PDF [英] Batch OCRing PDFs that haven't already been OCR'd

查看:108
本文介绍了尚未进行OCR的批量OCRing PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有10,000个PDF,其中一些是OCRed,其中一些具有OCRed的页面,而其余页面却没有,我该如何浏览所有PDF,而仅对那些没有的页面进行OCR还没做完吗?

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?

推荐答案

这正是我在寻找的东西,我有成千上万个扫描的PDF文件,其中一些已经被OCR处理了,有些还没有.

This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.

因此,我结合了在论坛和堆栈溢出中找到的信息,并制定了自己的解决方案,完全做到了这一点,在此为您总结一下:

So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:

  • 递归地扫描所有子目录以查找PDF文件;
  • 检查是否已对PDF进行OCR,否则,请使用您可以指定的语言以高质量的OCR处理PDF;
  • 将OCR PDF 保存为就地,作为PDF/A,并覆盖旧的(未进行OCR的)PDF文件.
  • scan through all subdirectories recursively for PDF files;
  • check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
  • save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.

我在Windows 10上,找不到确切的答案.我尝试使用Acrobat Pro执行此操作,但这给了我很多错误,并且Acrobat的批处理在每个错误或受密码保护的文件上停止.我还在Windows上尝试了许多其他批处理OCR工具,但都无法正常工作. 我花了无数小时来手动检查哪些文件已经在图像下方"具有文本层.

I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well. I spent countless hours manually checking which files already had a text-layer "under" the image.

UNTIL!微软宣布,现在很容易在Windows下,同一台计算机上,同一文件系统上运行Linux. 在Linux上可用的工具和实用程序比Windows多得多,因此我想尝试一下.

UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem. There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.

    在Windows控制面板中
  1. 启用 Linux的Windows子系统;有很多指南.去谷歌上查询.几分钟.
  2. 从Windows应用商店中安装Linux.打开Windows应用商店,搜索 Ubuntu ,然后安装.大约需要5分钟.
  3. 现在您有了"Ubuntu应用程序".运行.它显示了Linux bash,并通过/mnt/c 具有文件访问权限对Windows文件.太神奇了!
  4. 您需要一些Linux应用",即 pdffonts ocrmypdf ;您可以使用命令 sudo apt install pdffonts sudo apt install ocrmypdf 进行安装.我们将使用这些应用程序来检查PDF中是否存在嵌入字体,如果没有,则对PDF进行OCR. (请参阅下面的注释).
  5. 将非常小的bash脚本(如下)安装到主目录〜.
  6. 转到(cd)保存所有PDF的目录.例如:/mnt/c/Users/name/OneDrive/Documents.
  7. 运行命令:find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
  1. Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
  2. Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
  3. Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
  4. You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
  5. Install the very small bash script (below) to your home directory ~.
  6. Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
  7. Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

完成!

根据您拥有的PDF数量以及尚未进行OCR的PDF数量,运行此过程当然会花费很长时间.

Done!

Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.

这是sh脚本.您应该将其保存在主文件夹中的某个位置,以便可以从任何地方轻松调用.像这样:

Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:

  1. 键入cd ~.这会将您带到您的主文件夹.
  2. 键入pico pdf-ocr.sh.这将调出一个编辑器.粘贴以下脚本代码.然后按Ctrl + X,然后按Y.文件已保存.
  3. 键入sudo chmod +x pdf-ocr.sh.这将授予脚本运行权限.
  1. type cd ~. This will bring you to your home folder.
  2. type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
  3. type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.

MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

这是做什么的?

好吧,find命令在当前目录中查找所有PDF文件,包括子目录.然后,将这些文件发送"到脚本,在脚本中,pdffonts检查是否存在嵌入字体.如果是这样,请跳过该文件,然后尝试下一个.如果找不到嵌入字体,请使用ocrmypdf进行OCR编码. 我发现 ocrmypdf 的OCR质量非常好,甚至比Acrobat还要好.您当然可以调整设置.我可以想象,例如,您可能想对使用其他语言进行OCR.您可以在此处查找所有选项: https://ocrmypdf.readthedocs.io/en/latest/

What does this do?

Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing. I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/

注意:在这里我假设如果 PDF文件没有 no 嵌入字体(因此它基本上是图像(扫描)) (在PDF文件中),它已进行了OCR.我知道这可能并不总是准确和/或正确的,但对我而言,这足以确定要通过OCR放入哪些文件.这样就不必重新生成成百上千个PDF文件....

Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....

我知道在Windows下安装Linux会比较麻烦,但是如果您具有基本的Linux技能,那么这样做很容易.对我来说,这是值得的努力,因为现在我已经制造出了一键式"批处理程序.我无法使用Windows工具找到解决方案.

I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.

我希望有人发现这一点并觉得有用.如果有人有改进,请在此处发布.

I hope someone finds this and finds this useful. If anyone has improvements, please post them here.

谢谢.

Jos Jonkeren

Jos Jonkeren

这篇关于尚未进行OCR的批量OCRing PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆