尚未进行OCR的批量OCRing PDF [英] Batch OCRing PDFs that haven't already been OCR'd

查看：108 发布时间：2020/5/19 19:26:23 pdf ocr

本文介绍了尚未进行OCR的批量OCRing PDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我有10,000个PDF，其中一些是OCRed，其中一些具有OCRed的页面，而其余页面却没有，我该如何浏览所有PDF，而仅对那些没有的页面进行OCR还没做完吗?

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?

完成！

根据您拥有的PDF数量以及尚未进行OCR的PDF数量，运行此过程当然会花费很长时间.

Done!

Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.

这是sh脚本.您应该将其保存在主文件夹中的某个位置，以便可以从任何地方轻松调用.像这样:

Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:

键入cd ~.这会将您带到您的主文件夹.
键入pico pdf-ocr.sh.这将调出一个编辑器.粘贴以下脚本代码.然后按Ctrl + X，然后按Y.文件已保存.
键入sudo chmod +x pdf-ocr.sh.这将授予脚本运行权限.

type cd ~. This will bring you to your home folder.
type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.

MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

这是做什么的?

好吧，find命令在当前目录中查找所有PDF文件，包括子目录.然后，将这些文件发送"到脚本，在脚本中，pdffonts检查是否存在嵌入字体.如果是这样，请跳过该文件，然后尝试下一个.如果找不到嵌入字体，请使用ocrmypdf进行OCR编码. 我发现 ocrmypdf 的OCR质量非常好，甚至比Acrobat还要好.您当然可以调整设置.我可以想象，例如，您可能想对使用其他语言进行OCR.您可以在此处查找所有选项: https://ocrmypdf.readthedocs.io/en/latest/

What does this do?

Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing. I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/

注意:在这里我假设如果 PDF文件没有 no 嵌入字体(因此它基本上是图像(扫描)) (在PDF文件中)，它已未进行了OCR.我知道这可能并不总是准确和/或正确的，但对我而言，这足以确定要通过OCR放入哪些文件.这样就不必重新生成成百上千个PDF文件....

Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....

我知道在Windows下安装Linux会比较麻烦，但是如果您具有基本的Linux技能，那么这样做很容易.对我来说，这是值得的努力，因为现在我已经制造出了一键式"批处理程序.我无法使用Windows工具找到解决方案.

I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.

我希望有人发现这一点并觉得有用.如果有人有改进，请在此处发布.

I hope someone finds this and finds this useful. If anyone has improvements, please post them here.

谢谢.

Jos Jonkeren

这篇关于尚未进行OCR的批量OCRing PDF的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

尚未进行OCR的批量OCRing PDF [英] Batch OCRing PDFs that haven't already been OCR'd

问题描述

推荐答案

完成！

Done!

这是做什么的?

What does this do?

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

尚未进行OCR的批量OCRing PDF [英] Batch OCRing PDFs that haven&#39;t already been OCR&#39;d

问题描述

推荐答案

完成！

Done!

这是做什么的?

What does this do?

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

尚未进行OCR的批量OCRing PDF [英] Batch OCRing PDFs that haven't already been OCR'd

登录关闭