使用GNU并行将两个命令组合用于OCR项目 [英] Combine two commands using GNU parallel for OCR project
问题描述
我想编写一个脚本,该脚本在OCR
pdfs上运行命令,该命令会在写入文本文件后删除生成的图像.
I would like to write a script which runs a command to OCR
pdfs, which deletes the resulting images, after the text files has been written.
我要组合的两个命令如下.
The two commands I want to combine are the following.
此命令创建文件夹,从每个PDF
中提取pgm
并将其添加到每个文件夹中:
This command create folders, extract pgm
from each PDF
and adds them into each folder:
time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'
此命令执行OCR并删除生成的图像(pgm
):
This commands does the OCR and deletes the resulting images (pgm
):
time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
我想结合两个命令,以便脚本在每个OCR之后删除pgm
图像.如果我运行上述命令,则第一个命令将提取图像并吞噬我的磁盘空间,然后第二个命令将执行OCR,只有在删除了图像之后,最后一步.
I would like to combine both commands so that the script deletes the pgm
images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.
所以
- 创建文件夹
- 从PDF提取PGM
- 从PGM到txt的OCR
- 删除刚刚使用过的PGM图像(丢失)
基本上,我希望对每个分隔的PDF
而不是一次对所有PDF
依次执行这4个步骤.我该怎么办?
Basically, I would like this 4 steps to be done in this order for each PDF
separated and not for all PDF
at once. How can I do this?
我解决我的问题的第一个尝试是创建以下命令:
My first attempt to solve my issues was to create the following command:
time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
但是,tesseract无法找到语言包.
However, tesseract would not find the language package.
推荐答案
更新后的答案
我尚未对此进行测试,请在您的一小部分文件的副本上运行它.如果高兴,您可以在开始时用DEBUG:
关闭消息:
I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG:
at the start if you are happy it looks good:
#!/bin/bash
# Declare a function for "parallel" to call
doit() {
# Get name of PDF with and without extension
withext="$1"
noext="$2"
echo "DEBUG: Processing $withext into $noext"
# Make output directory
mkdir -p "$noext"
# Extract as PGM into subdirectory
gs ... -o "$noext"/"${noext}-%03d.pgm $withext"
# Go to target directory or die with error message
cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }
# OCR and remove each PGM
n=0
for f in *pgm; do
echo "DEBUG: OCR $f into $n"
tesseract "$f" "$n" -l deu_frak
echo "DEBUG: Remove $f"
rm "$f"
((n=n+1))
done
}
# Ensure the function is exported to subshells
export -f doit
find . -name \*.pdf -print0 | parallel -0 doit {} {.}
您应该能够通过运行以下命令来测试doit()
功能,而无需parallel
:
You should be able to test the doit()
function without parallel
by running:
doit someFile.pdf someFile
原始答案
如果要为 GNU Parallel 中的每个参数做很多事情,最简单的方法是声明一个bash
函数,然后调用该函数.
If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash
function and then call that.
它看起来像这样:
# Declare a function for "parallel" to call
doit() {
echo "$1" "$2"
# mkdir something
# extract PGM
# do OCR
# delete PGM
}
# Ensure the function is exported to subshells
export -f doit
find some files -print0 | parallel -0 doit {} {.}
这篇关于使用GNU并行将两个命令组合用于OCR项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!