Windows 上的 Perl Image::OCR::Tesseract 模块 [英] Perl Image::OCR::Tesseract module on Windows

查看:89
本文介绍了Windows 上的 Perl Image::OCR::Tesseract 模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道在 Windows 上安装Image::OCR::Tesseract"模块的优雅方式?由于名为LEOCHARRE::CLI"的 *NIX only 模块依赖项,该模块无法通过 CPAN 安装在 Windows 上.这个模块似乎不需要运行Image::OCR::Tesseract"本身.

Anyone out there know of a graceful way to install the "Image::OCR::Tesseract" module on Windows? The module fails to install on Windows via CPAN due to a *NIX only module dependency called "LEOCHARRE::CLI". This module does not seem to be required to run "Image::OCR::Tesseract" itself.

首先手动安装 makefile.pl 中列出的依赖模块(LEOCHARRE::CLI"除外),然后将模块文件移动到C"下的正确目录结构,我设法使模块工作:\Perl\site\lib\Image\OCR".使其工作的最后一部分是更改从命令行调用 ImageMagick 和 Tesseract 可执行文件的代码部分,以便在模块调用可执行文件时在程序名称周围加上引号.

I've managed to get the module working by first manually installing the dependency modules listed in the makefile.pl (except for "LEOCHARRE::CLI") and then by moving the module file to the correct directory structure under "C:\Perl\site\lib\Image\OCR". The final part of getting it to work was to alter the section of code that calls the ImageMagick and Tesseract executables from the command line to put quotes around the program names when the executables are called by module.

这行得通,但我真的觉得在生产系统上从适用于 Windows 的存储库中安装 PPM 或 CPAN 感觉更好.

This works, but I'd really feel better about doing a PPM or CPAN install on a production system from a repo that works on Windows.

推荐答案

没关系,我明白了,虽然我无法决定什么是更好的解决方案.

Never mind, I got it, though I can't decide what is the better solution.

通过传统的perl makefile.pl, make, make test, make install"让安装程序在 Windows 上工作例程需要编辑 Makefile.pl 脚本,包括缺少的 Windows 安装模块 (Devel::AssertOS::MSWin32),并修补 AssertEXE.pm 以使用File::Which".而不是内置的外壳which"Windows 缺少的命令.所有这一切仍然需要Image::OCR::Tesseract"在执行转换"时修补程序名称周围的引号;和tesseract"从命令行.

To get the installer to work on Windows via the traditional "perl makefile.pl, make, make test, make install" routine requires an edit to the Makefile.pl script, including the missing Windows install module (Devel::AssertOS::MSWin32), and patch to AssertEXE.pm to use "File::Which" rather than the built in shell "which" command that Windows lacks. All this still requires that The "Image::OCR::Tesseract" be patched to put quotes around program names when executing "convert" and "tesseract" from the command line.

鉴于使安装程序在 Windows 上运行所涉及的步骤数,以及该模块没有为模块创建二进制组件以链接到这一事实,我认为是安装和使 Tesseract 模块正常工作的最佳选择在 Windows 上将首先安装以下二进制包:

Given the number of steps involved to make the installer work on Windows, and the fact the module does not create a binary component for the module to link to, I'd say the best option for installing and getting the Tesseract module working on windows would be to first install the following binary packages:

ImageMagick链接

ImageMagick Link

正方体http://code.google.com/p/tesseract-ocr/downloads/列表

接下来,找到您的 Perl 模块目录 - 在我的系统上它是C:\Perl\site\lib".如果没有,请创建一个文件夹Image".接下来,打开 Image 文件夹并创建一个名为OCR"的文件夹.打开 OCR 文件夹.此时,您的路径应该是C:\Perl\site\lib\Image\OCR".新建一个名为Tesseract.pm"的文本文件,复制以下内容...

Next, locate your Perl module directory - on my system it is "C:\Perl\site\lib". Create a folder "Image", if you don't have one. Next, open the Image folder and create a folder called "OCR". Open the OCR folder. At this point, your path should be something along the lines of "C:\Perl\site\lib\Image\OCR". Create a new text file called "Tesseract.pm", and copy in the following content...

package Image::OCR::Tesseract;
use strict;
use Carp;
use Cwd;
use String::ShellQuote 'shell_quote';
use Exporter;
use vars qw(@EXPORT_OK @ISA $VERSION $DEBUG $WHICH_TESSERACT $WHICH_CONVERT %EXPORT_TAGS @TRASH);
@ISA = qw(Exporter);
@EXPORT_OK = qw(get_ocr get_hocr _tesseract convert_8bpp_tif tesseract);
$VERSION = sprintf "%d.%02d", q$Revision: 1.24 $ =~ /(\d+)/g;
%EXPORT_TAGS = ( all => \@EXPORT_OK );


BEGIN {
   use File::Which 'which';
   $WHICH_TESSERACT = which('tesseract');
   $WHICH_CONVERT   = which('convert');
   
   if($^O=~m/MSWin/) {
      $WHICH_TESSERACT='"'.$WHICH_TESSERACT.'"';
      $WHICH_CONVERT='"'.$WHICH_CONVERT.'"';
   }
   $WHICH_TESSERACT or die("Is tesseract installed? Cannot find bin path to tesseract.");
   $WHICH_CONVERT or die("Is convert installed? Cannot find bin path to convert.");
}

END {
   scalar @TRASH or return;
   if ( $DEBUG ){
      print STDERR "Debug on, these are trash files:\n".join("\n",@TRASH) ;
   }
   else {
      unlink @TRASH;
   }
}

sub DEBUG { Carp::cluck("Image::OCR::Tesseract::DEBUG() deprecated") }

sub get_hocr {
   my ($abs_image,$abs_tmp_dir,$lang)= @_;
   -f $abs_image or croak("$abs_image is not a file on disk");
   my $hocr="hocr";
   if(defined $abs_tmp_dir){

      -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");

      $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
      my $abs_copy = "$abs_tmp_dir/$1";

      # TODO, what if source and dest are same, i want it to die
      require File::Copy;
      File::Copy::copy($abs_image, $abs_copy) 
         or die("cant make copy of $abs_image to $abs_copy, $!");

      # change the image to get ocr from to be the copy
      $abs_image = $abs_copy;
      # since it's a copy. erase that on exit
      push @TRASH, $abs_image;      
   }

   my $tmp_tif = convert_8bpp_tif($abs_image);
   
   push @TRASH, $tmp_tif; # for later delete

   _tesseract($tmp_tif,$lang,$hocr) || '';
}

sub get_ocr {
   my ($abs_image,$abs_tmp_dir,$lang)= @_;
   -f $abs_image or croak("$abs_image is not a file on disk");
   if(defined $abs_tmp_dir){

      -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");

      $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
      my $abs_copy = "$abs_tmp_dir/$1";

      # TODO, what if source and dest are same, i want it to die
      require File::Copy;
      File::Copy::copy($abs_image, $abs_copy) 
         or die("cant make copy of $abs_image to $abs_copy, $!");

      # change the image to get ocr from to be the copy
      $abs_image = $abs_copy;
      # since it's a copy. erase that on exit
      push @TRASH, $abs_image;      
   }

   my $tmp_tif = convert_8bpp_tif($abs_image);
   
   push @TRASH, $tmp_tif; # for later delete

   _tesseract($tmp_tif,$lang) || '';
}

sub convert_8bpp_tif {
   my ($abs_img,$abs_out) = (shift,shift);
   defined $abs_img or die('missing image arg');

   $abs_out ||= $abs_img.'.tmp.'.time().(int rand(9000)).'.tif';
   
   my @arg = ( $WHICH_CONVERT, $abs_img, '-compress','none','+matte', $abs_out );
   
   #die (join(" ", @arg));
   
   system(@arg) == 0 or die("convert $abs_img error.. $?");

   $DEBUG and warn("made $abs_out 8bpp tiff.");
   $abs_out;
}



# people expect tesseract to automatically convert

*tesseract = \&_tesseract;
sub _tesseract {
    my ($abs_image,$lang,$hocr) = @_;
   defined $abs_image or croak('missing image path arg');
   
   $abs_image=~/\.tif+$/i or warn("Are you sure '$abs_image' is a tif image? This operation may fail.");
   
   #my @arg = (
   #   $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image), 
   #   (defined $lang and ('-l', $lang) ), '2>/dev/null'
   #); 

   my $cmd = 
      ( sprintf '%s %s %s', 
         $WHICH_TESSERACT, 
         shell_quote($abs_image), 
         shell_quote($abs_image) 
      ) .
      ( defined $lang ? " -l $lang" : '' ) .
      ( defined $hocr ? " hocr" : '' ) .
      "  2>/dev/null";
   $DEBUG and warn "command: $cmd";

    system($cmd); # hard to check ==0 

    my $txt = $abs_image.($hocr?".html":".txt");
   unless( -f $txt ){      
        Carp::cluck("no text output for image '$abs_image'. (No text file '$txt' found on disk)");
      return;
   }

    $DEBUG and warn "Found text file '$txt'";
   
   my $content = (_slurp($txt) || '');   
   $DEBUG and warn("content length of text in '$txt' from image '$abs_image' is ". length $content );
   push @TRASH, $txt;

   $content;
}

sub _slurp {
   my $abs = shift;
   open(FILE,'<', $abs) or die("can't open file for reading '$abs', $!");
   local $/;
   my $txt = <FILE>;
   close FILE;
   $txt;
}  

1;


__END__

#sub _force_imgtype {
#   my $img = shift;
#   my $type = shift;
#   my $delete_original = shift;
#   $delete_original ||=0;
#   
#
#   if($img=~/\.$type$/i){
#      return $img;
#   }
#
#   my $img_out= $img;
#   $img_out=~s/\.\w{1,5}$/\.$type/ or die("cant get file ext for $img");
#
#
#
#}

保存并关闭.如果您在安装 ImageMagick 和 Tesseract 二进制文件之前打开了一个命令行会话,请关闭命令行会话并打开一个新会话.使用以下脚本测试模块:

Save and close. Close the command line session and open a new one if you've had one open from before you did the ImageMagick and Tesseract binary installs. Test the module with the following script:

use Image::OCR::Tesseract;
my $image = 'SomeImageFileThatContainsText.jpg';

my $text = Image::OCR::Tesseract::get_ocr($image);

print "Text...\n";
print $text."\n";

print "Normal Exit\n";

exit;

就是这样.凌乱,我知道,但没有好办法绕过模块安装程序确实需要更新以支持 Windows(和其他)系统的事实,即使实际的模块代码几乎无需修改即可运行.真的,如果 Tesseract 和 ImageMagick 安装在没有空格的路径上,那么Image::OCR::Tesseract"就会出现.模块代码不需要任何更改,但这个小调整可以让支持的可执行文件安装在任何地方,包括默认位置.

That's it. Messy, I know, but there's no good way around the fact that the module installer really needs to be updated to support Windows (and other) systems even though the actual module code almost runs without modification. Really, if Tesseract and ImageMagick were installed to paths without spaces then the "Image::OCR::Tesseract" module code would not need any changes, but this minor tweak lets the supporting executables be installed anywhere, including the default locations.

这篇关于Windows 上的 Perl Image::OCR::Tesseract 模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆