通过virtualenv在AWS Lambda上的Tesseract OCR [英] Tesseract OCR on AWS Lambda via virtualenv

查看:76
本文介绍了通过virtualenv在AWS Lambda上的Tesseract OCR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经花了整整一个星期的时间来尝试这个,所以这有点像个冰雹玛丽.

I have spent all week attempting this, so this is a bit of a hail mary.

我正在尝试将Tesseract OCR打包到运行在Python上的AWS Lambda中(我也使用PILLOW进行图像预处理,因此选择了Python).

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).

我了解如何使用virtualenv将Python软件包部署到AWS上,但是我似乎找不到找到将实际的Tesseract OCR部署到环境(例如/env/)中的方法

I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)

  • 执行pip install py-tesseract会成功将python包装器部署到/env/中,但这依赖于Tesseract的单独(本地)安装
  • 执行pip install tesseract-ocr只能使我获得一定距离,然后它才会出现以下错误,我认为这是由于缺少leptonica依赖性所致.但是,我不知道如何将leptonica打包到/env/中(如果可能的话)
  • Doing pip install py-tesseract results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract
  • Doing pip install tesseract-ocr gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)
tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found
#include "leptonica/allheaders.h"

  • 从中下载0.9.1 python-tesseract egg文件 https://bitbucket.org/3togo/python-tesseract/downloads 并同时进行easy_install查找依赖项时出错
    • Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies
    • Processing dependencies for python-tesseract==0.9.1
      Searching for python-tesseract==0.9.1
      Reading https://pypi.python.org/simple/python-tesseract/
      Couldn't find index page for 'python-tesseract' (maybe misspelled?)
      Scanning index of all packages (this may take a while)
      Reading https://pypi.python.org/simple/
      No local packages or download links found for python-tesseract==0.9.1
      

      任何指针将不胜感激.

      Any pointers would be greatly appreciated.

      推荐答案

      不起作用的原因是因为这些python软件包只是tesseract的包装.您必须使用AWS Linux实例编译tesseract,然后将二进制文件和库复制到lambda函数的zip文件中.

      The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

      1)使用64位Amazon Linux启动EC2实例;

      2)安装依赖项:

      sudo yum install gcc gcc-c++ make
      sudo yum install autoconf aclocal automake
      sudo yum install libtool
      sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel
      

      3)编译并安装leptonica:

      cd ~
      mkdir leptonica
      cd leptonica
      wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
      tar -zxvf leptonica-1.73.tar.gz
      cd leptonica-1.73
      ./configure
      make
      sudo make install
      

      4)编译并安装tesseract

      cd ~
      mkdir tesseract
      cd tesseract
      wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
      tar -zxvf 3.04.01.tar.gz
      cd tesseract-3.04.01
      ./autogen.sh
      ./configure
      make
      sudo make install
      

      5)将经过语言训练的数据下载到tessdata

      cd /usr/local/share/tessdata
      wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
      export TESSDATA_PREFIX=/usr/local/share/
      

      这时,您应该可以在此EC2实例上使用tesseract.要复制tesseract的二进制文件并将其用于lambda函数,您需要将一些文件从该实例复制到您上传到lambda的zip文件中.我将张贴所有命令以获取包含您需要的所有文件的zip文件.

      At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

      6)压缩在lambda上运行tesseract所需的所有内容

      cd ~
      mkdir tesseract-lambda
      cd tesseract-lambda
      cp /usr/local/bin/tesseract .
      mkdir lib
      cd lib
      cp /usr/local/lib/libtesseract.so.3 .
      cp /usr/local/lib/liblept.so.5 .
      cp /usr/lib64/libpng12.so.0 .
      cd ..
      
      mkdir tessdata
      cd tessdata
      cp /usr/local/share/tessdata/eng.traineddata .
      cd ..
      
      cd ..
      zip -r tesseract-lambda.zip tesseract-lambda
      

      tesseract-lambda.zip文件包含lambda运行tesseract所需的所有内容.最后要做的是在zip文件的根目录中添加lambda函数,并将其上传到lambda.这是一个我尚未测试但应该可以使用的示例.

      The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

      7)创建一个名为main.py的文件,编写一个类似于上面的lambda函数,并将其添加到tesseract-lambda.zip的根目录:

      from __future__ import print_function
      
      import urllib
      import boto3
      import os
      import subprocess
      
      SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
      LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
      
      s3 = boto3.client('s3')
      
      def lambda_handler(event, context):
      
          # Get the bucket and object from the event
          bucket = event['Records'][0]['s3']['bucket']['name']
          key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
      
          try:
              print("Bucket: " + bucket)
              print("Key: " + key)
      
              imgfilepath = '/tmp/image.png'
              jsonfilepath = '/tmp/result.txt'
              exportfile = key + '.txt'
      
              print("Export: " + exportfile)
      
              s3.download_file(bucket, key, imgfilepath)
      
              command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
                  LIB_DIR,
                  SCRIPT_DIR,
                  SCRIPT_DIR,
                  imgfilepath,
                  jsonfilepath,
              )
      
              try:
                  output = subprocess.check_output(command, shell=True)
                  print(output)
                  s3.upload_file(jsonfilepath, bucket, exportfile)
              except subprocess.CalledProcessError as e:
                  print(e.output)
      
          except Exception as e:
              print(e)
              print('Error processing object {} from bucket {}.'.format(key, bucket))
              raise e
      

      在AWS控制台上创建AWS Lambda函数时,上传zip文件并将Hanlder设置为main.lambda_handler.这将告诉AWS Lambda在zip中查找main.py文件并调用函数lambda_handler.

      When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

      重要

      AWS Lambda的环境有时会发生变化.例如,lambda env的当前图像是amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2(当您阅读此答案时,可能不是这个图像).如果tesseract开始返回分段错误,请在Lambda函数上运行"ldd tesseract",并查看输出所需的lib(当前为libtesseract.so.3 liblept.so.5 libpng12.so.0).

      From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

      感谢您的评论,SergioArcos.

      Thanks for the comment, SergioArcos.

      这篇关于通过virtualenv在AWS Lambda上的Tesseract OCR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆