Tesseract OCR 大量文件 [英] Tesseract OCR large number of files

查看:63
本文介绍了Tesseract OCR 大量文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的硬盘上有大约 135000 个 .TIF 文件(1.2KB 到 1.4KB).我需要从这些文件中提取文本.如果我将 tesseract 作为 cron 作业运行,我每小时最多可以获得 500 到 600 个.谁能给我建议策略,这样我每分钟至少可以获得 500 个?

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?

更新:

以下是我在执行@Mark 给出的建议后的代码,但我似乎每分钟没有超过 20 个文件.

Below is my code after implementing on suggestions given by @Mark still I dont seem to go beyond 20 files per min.

#!/bin/bash

cd /mnt/ramdisk/input

function tess() 
{
    if [ -f /mnt/ramdisk/output/$2.txt ]
        then
        echo skipping $2
        return
    fi
    tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}

export -f tess

find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}

推荐答案

您需要 GNU Parallel.在这里,我在 iMac 上每 37 秒处理 500 个 3kB 的 TIF 文件.相比之下,如果在顺序 for 循环中完成相同的处理需要 160 秒.

You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.

基本命令如下所示:

parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif

这将显示一个进度条并使用您机器上的所有可用内核.这是在行动:

which will show a progress bar and use all available cores on your machine. Here it is in action:

如果你想看看它会做什么而不实际做任何事情,使用parallel --dry-run.

If you want to see what it would do without actually doing anything, use parallel --dry-run.

由于您有 135,000 个文件,它可能会溢出您的命令行长度 - 您可以像这样使用 sysctl 进行检查:

As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:

sysctl -a kern.argmax
kern.argmax: 262144

因此您需要将文件名泵入 GNU Parallelstdin 并用空字符将它们分开,这样您就不会遇到空格问题:

So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:

find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'

<小时>

如果您正在处理非常大量的文件,您可能需要考虑被中断和重新启动的可能性.您可以在处理到名为 processed 的子目录后 mv 每个 TIF 文件,这样它就不会在重新启动时再次完成,或者您可以在像这样处理任何 TIF 之前测试相应的 txt 文件是否存在:


If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:

#!/bin/bash

doit() {
   if [ -f "${2}.txt" ]; then
      echo Skipping $1...
      return
   fi
   tesseract "$1" "$2" > /dev/null 2>&1
}

export -f doit
time parallel --bar doit {} {.} ::: *.tif

如果你连续运行两次,你会看到第二次几乎是瞬时的,因为所有的处理都是第一次完成的.

If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.

如果你有数百万个文件,你可以考虑并行使用多台机器,所以只要确保你有 ssh 登录到你网络上的每台机器,然后在 4 台机器上运行,包括像这样的本地主机:

If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:

parallel -S :,remote1,remote2,remote3 ...

其中 : 是您正在运行的机器的简写.

where : is shorthand for the machine on which you are running.

这篇关于Tesseract OCR 大量文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆