通过 virtualenv 在 AWS Lambda 上使用 Tesseract OCR [英] Tesseract OCR on AWS Lambda via virtualenv

查看:35
本文介绍了通过 virtualenv 在 AWS Lambda 上使用 Tesseract OCR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我整个星期都在尝试这个,所以这有点像万能的.

我正在尝试将 Tesseract OCR 打包到在 Python 上运行的 AWS Lambda(我还使用 PILLOW 进行图像预处理,因此选择了 Python).

我了解如何使用 virtualenv 将 Python 包部署到 AWS,但是我似乎找不到将实际 Tesseract OCR 部署到环境中的方法(例如/env/)

  • 执行 pip install py-tesseract 会导致将 python 包装器成功部署到/env/,但这依赖于单独(本地)安装 Tesseract
  • 执行 pip install tesseract-ocr 在它出现如下错误之前只得到了一定的距离,我假设这是由于缺少 leptonica 依赖性.但是,我不知道如何将 leptonica 打包到/env/(如果可能的话)
<块引用>

tesseract_ocr.cpp:264:10:致命错误:找不到leptonica/allheaders.h"文件#include "leptonica/allheaders.h"

<块引用>

python-tesseract==0.9.1的处理依赖搜索 python-tesseract==0.9.1阅读 https://pypi.python.org/simple/python-tesseract/找不到python-tesseract"的索引页(可能拼写错误?)扫描所有包的索引(这可能需要一段时间)阅读 https://pypi.python.org/simple/没有找到 python-tesseract==0.9.1 的本地包或下载链接

任何指针将不胜感激.

解决方案

它不工作的原因是因为这些 python 包只是 tesseract 的包装器.您必须使用 AWS Linux 实例编译 tesseract,并将二进制文件和库复制到 lambda 函数的 zip 文件中.

1) 使用 64 位 Amazon Linux 启动 EC2 实例;

2) 安装依赖项:

sudo yum install gcc gcc-c++ make须藤 yum 安装 autoconf aclocal automake须藤 yum 安装 libtool须藤 yum 安装 libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) 编译安装leptonica:

cd ~mkdir leptonicacd leptonicawget http://www.leptonica.com/source/leptonica-1.73.tar.gz焦油-zxvf leptonica-1.73.tar.gzcd leptonica-1.73./配置制作须藤制作安装

4) 编译安装tesseract

cd ~mkdir tesseractcd tesseractwget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz焦油-zxvf 3.04.01.tar.gzcd tesseract-3.04.01./autogen.sh./配置制作须藤制作安装

5) 将语言训练数据下载到 tessdata

cd/usr/local/share/tessdatawget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata导出 TESSDATA_PREFIX=/usr/local/share/

此时您应该可以在这个 EC2 实例上使用 tesseract.要复制 tesseract 的二进制文件并在 lambda 函数上使用它,您需要将此实例中的一些文件复制到您上传到 lambda 的 zip 文件中.我将发布所有命令以获取包含您需要的所有文件的 zip 文件.

6) 压缩在 lambda 上运行 tesseract 所需的所有东西

cd ~mkdir tesseract-lambdacd tesseract-lambdacp/usr/local/bin/tesseract .目录库光盘库cp/usr/local/lib/libtesseract.so.3 .cp/usr/local/lib/liblept.so.5 .cp/usr/lib64/libpng12.so.0 .光盘..mkdir tessdatacd tessdatacp/usr/local/share/tessdata/eng.traineddata .光盘..光盘..zip -r tesseract-lambda.zip tesseract-lambda

tesseract-lambda.zip 文件包含 lambda 运行 tesseract 所需的一切.最后要做的是在 zip 文件的根目录添加 lambda 函数并将其上传到 lambda.这是一个我没有测试过但应该可以工作的例子.

7) 创建一个名为 main.py 的文件,编写一个与上面类似的 lambda 函数并将其添加到 tesseract-lambda.zip 的根目录:

from __future__ import print_function导入 urllib导入 boto3导入操作系统导入子流程SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')s3 = boto3.client('s3')def lambda_handler(事件,上下文):# 从事件中获取桶和对象bucket = event['Records'][0]['s3']['bucket']['name']key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')尝试:打印(桶:" + 桶)打印(键:"+键)imgfilepath = '/tmp/image.png'jsonfilepath = '/tmp/result.txt'导出文件 = 键 + '.txt'打印(导出:" + 导出文件)s3.download_file(bucket, key, imgfilepath)command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(LIB_DIR,脚本目录,脚本目录,img文件路径,json文件路径,)尝试:输出 = subprocess.check_output(命令,shell=True)打印(输出)s3.upload_file(jsonfilepath,bucket,exportfile)除了 subprocess.CalledProcessError 为 e:打印(电子输出)除了作为 e 的例外:打印(e)打印('错误处理对象 {} from bucket {}.'.format(key,bucket))提高e

在 AWS 控制台上创建 AWS Lambda 函数时,上传 zip 文件并将 Hanlder 设置为 main.lambda_handler.这将告诉 AWS Lambda 在 zip 中查找 main.py 文件并调用函数 lambda_handler.

重要

AWS Lambda 的环境不时发生变化.例如,lambda env 的当前图像是 amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2(当您阅读此答案时,它可能不是这个).如果 tesseract 开始返回分段错误,请在 Lambda 函数上运行ldd tesseract"并查看所需库的输出(当前为 libtesseract.so.3 liblept.so.5 libpng12.so.0).

感谢您的评论,塞尔吉奥阿科斯.

I have spent all week attempting this, so this is a bit of a hail mary.

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).

I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)

  • Doing pip install py-tesseract results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract
  • Doing pip install tesseract-ocr gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)

tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found
#include "leptonica/allheaders.h"

Processing dependencies for python-tesseract==0.9.1
Searching for python-tesseract==0.9.1
Reading https://pypi.python.org/simple/python-tesseract/
Couldn't find index page for 'python-tesseract' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for python-tesseract==0.9.1

Any pointers would be greatly appreciated.

解决方案

The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

1) Start an EC2 instance with 64-bit Amazon Linux;

2) Install dependencies:

sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) Compile and install leptonica:

cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install

4) Compile and install tesseract

cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install

5) Download language traineddata to tessdata

cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/

At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

6) Zip all the stuff you need to run tesseract on lambda

cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..

mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..

cd ..
zip -r tesseract-lambda.zip tesseract-lambda

The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:

from __future__ import print_function

import urllib
import boto3
import os
import subprocess

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')

s3 = boto3.client('s3')

def lambda_handler(event, context):

    # Get the bucket and object from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')

    try:
        print("Bucket: " + bucket)
        print("Key: " + key)

        imgfilepath = '/tmp/image.png'
        jsonfilepath = '/tmp/result.txt'
        exportfile = key + '.txt'

        print("Export: " + exportfile)

        s3.download_file(bucket, key, imgfilepath)

        command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            imgfilepath,
            jsonfilepath,
        )

        try:
            output = subprocess.check_output(command, shell=True)
            print(output)
            s3.upload_file(jsonfilepath, bucket, exportfile)
        except subprocess.CalledProcessError as e:
            print(e.output)

    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}.'.format(key, bucket))
        raise e

When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

IMPORTANT

From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

Thanks for the comment, SergioArcos.

这篇关于通过 virtualenv 在 AWS Lambda 上使用 Tesseract OCR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆