使用GNU并行将两个命令组合用于OCR项目 [英] Combine two commands using GNU parallel for OCR project

查看:124
本文介绍了使用GNU并行将两个命令组合用于OCR项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个脚本,该脚本在OCR pdfs上运行命令,该命令会在写入文本文件后删除生成的图像.

I would like to write a script which runs a command to OCR pdfs, which deletes the resulting images, after the text files has been written.

我要组合的两个命令如下.

The two commands I want to combine are the following.

此命令创建文件夹,从每个PDF中提取pgm并将其添加到每个文件夹中:

This command create folders, extract pgm from each PDF and adds them into each folder:

time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'

此命令执行OCR并删除生成的图像(pgm):

This commands does the OCR and deletes the resulting images (pgm):

time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

我想结合两个命令,以便脚本在每个OCR之后删除pgm图像.如果我运行上述命令,则第一个命令将提取图像并吞噬我的磁盘空间,然后第二个命令将执行OCR,只有在删除了图像之后,最后一步.

I would like to combine both commands so that the script deletes the pgm images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.

所以

  1. 创建文件夹
  2. 从PDF提取PGM
  3. 从PGM到txt的OCR
  4. 删除刚刚使用过的PGM图像(丢失)

基本上,我希望对每个分隔的PDF而不是一次对所有PDF依次执行这4个步骤.我该怎么办?

Basically, I would like this 4 steps to be done in this order for each PDF separated and not for all PDF at once. How can I do this?

我解决我的问题的第一个尝试是创建以下命令:

My first attempt to solve my issues was to create the following command:

time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

但是,tesseract无法找到语言包.

However, tesseract would not find the language package.

推荐答案

更新后的答案

我尚未对此进行测试,请在您的一小部分文件的副本上运行它.如果高兴,您可以在开始时用DEBUG:关闭消息:

I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG: at the start if you are happy it looks good:

#!/bin/bash

# Declare a function for "parallel" to call
doit() {
    # Get name of PDF with and without extension
    withext="$1"
    noext="$2"
    echo "DEBUG: Processing $withext into $noext"

    # Make output directory
    mkdir -p "$noext"

    # Extract as PGM into subdirectory
    gs ... -o "$noext"/"${noext}-%03d.pgm $withext"

    # Go to target directory or die with error message
    cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }

    # OCR and remove each PGM 
    n=0
    for f in *pgm; do
       echo "DEBUG: OCR $f into $n"
       tesseract "$f" "$n" -l deu_frak
       echo "DEBUG: Remove $f"
       rm "$f"
       ((n=n+1))
    done 
}

# Ensure the function is exported to subshells
export -f doit

find . -name \*.pdf -print0 | parallel -0 doit {} {.}

您应该能够通过运行以下命令来测试doit()功能,而无需parallel:

You should be able to test the doit() function without parallel by running:

doit someFile.pdf someFile

原始答案

如果要为 GNU Parallel 中的每个参数做很多事情,最简单的方法是声明一个bash函数,然后调用该函数.

If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash function and then call that.

它看起来像这样:

# Declare a function for "parallel" to call
doit() {
    echo "$1" "$2"
    # mkdir something
    # extract PGM
    # do OCR
    # delete PGM
}

# Ensure the function is exported to subshells
export -f doit

find some files -print0 | parallel -0 doit {} {.}

这篇关于使用GNU并行将两个命令组合用于OCR项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆