Tesseract OCR 大量文件 [英] Tesseract OCR large number of files
问题描述
我的硬盘上有大约 135000 个 .TIF
文件(1.2KB 到 1.4KB).我需要从这些文件中提取文本.如果我将 tesseract
作为 cron 作业运行,我每小时最多可以获得 500 到 600 个.谁能给我建议策略,这样我每分钟至少可以获得 500 个?
I have around 135000 .TIF
files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract
as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
更新:
以下是我在执行@Mark 给出的建议后的代码,但我似乎每分钟没有超过 20 个文件.
Below is my code after implementing on suggestions given by @Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
推荐答案
您需要 GNU Parallel.在这里,我在 iMac 上每 37 秒处理 500 个 3kB 的 TIF 文件.相比之下,如果在顺序 for
循环中完成相同的处理需要 160 秒.
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for
loop.
基本命令如下所示:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
这将显示一个进度条并使用您机器上的所有可用内核.这是在行动:
which will show a progress bar and use all available cores on your machine. Here it is in action:
如果你想看看它会做什么而不实际做任何事情,使用parallel --dry-run
.
If you want to see what it would do without actually doing anything, use parallel --dry-run
.
由于您有 135,000 个文件,它可能会溢出您的命令行长度 - 您可以像这样使用 sysctl
进行检查:
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl
like this:
sysctl -a kern.argmax
kern.argmax: 262144
因此您需要将文件名泵入 GNU Parallel 的 stdin
并用空字符将它们分开,这样您就不会遇到空格问题:
So you need to pump the filenames into GNU Parallel on its stdin
and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
<小时>
如果您正在处理非常大量的文件,您可能需要考虑被中断和重新启动的可能性.您可以在处理到名为 processed
的子目录后 mv
每个 TIF
文件,这样它就不会在重新启动时再次完成,或者您可以在像这样处理任何 TIF
之前测试相应的 txt
文件是否存在:
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv
each TIF
file after processing to a subdirectory called processed
so that it won't get done again on restarting, or you could test for the existence of the corresponding txt
file before processing any TIF
like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
如果你连续运行两次,你会看到第二次几乎是瞬时的,因为所有的处理都是第一次完成的.
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
如果你有数百万个文件,你可以考虑并行使用多台机器,所以只要确保你有 ssh
登录到你网络上的每台机器,然后在 4 台机器上运行,包括像这样的本地主机:
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh
logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
其中 :
是您正在运行的机器的简写.
where :
is shorthand for the machine on which you are running.
这篇关于Tesseract OCR 大量文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!