通过virtualenv在AWS Lambda上的Tesseract OCR [英] Tesseract OCR on AWS Lambda via virtualenv
问题描述
我已经花了整整一个星期的时间来尝试这个,所以这有点像个冰雹玛丽.
I have spent all week attempting this, so this is a bit of a hail mary.
我正在尝试将Tesseract OCR打包到运行在Python上的AWS Lambda中(我也使用PILLOW进行图像预处理,因此选择了Python).
I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).
我了解如何使用virtualenv将Python软件包部署到AWS上,但是我似乎找不到找到将实际的Tesseract OCR部署到环境(例如/env/)中的方法
I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)
- 执行
pip install py-tesseract
会成功将python包装器部署到/env/中,但这依赖于Tesseract的单独(本地)安装 - 执行
pip install tesseract-ocr
只能使我获得一定距离,然后它才会出现以下错误,我认为这是由于缺少leptonica依赖性所致.但是,我不知道如何将leptonica打包到/env/中(如果可能的话)
- Doing
pip install py-tesseract
results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract - Doing
pip install tesseract-ocr
gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)
tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found
#include "leptonica/allheaders.h"
- 从中下载0.9.1 python-tesseract egg文件 https://bitbucket.org/3togo/python-tesseract/downloads 并同时进行easy_install查找依赖项时出错
- Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies
Processing dependencies for python-tesseract==0.9.1
Searching for python-tesseract==0.9.1
Reading https://pypi.python.org/simple/python-tesseract/
Couldn't find index page for 'python-tesseract' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for python-tesseract==0.9.1
任何指针将不胜感激.
Any pointers would be greatly appreciated.
推荐答案
不起作用的原因是因为这些python软件包只是tesseract的包装.您必须使用AWS Linux实例编译tesseract,然后将二进制文件和库复制到lambda函数的zip文件中.
The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.
1)使用64位Amazon Linux启动EC2实例;
2)安装依赖项:
sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel
3)编译并安装leptonica:
cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install
4)编译并安装tesseract
cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install
5)将经过语言训练的数据下载到tessdata
cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/
这时,您应该可以在此EC2实例上使用tesseract.要复制tesseract的二进制文件并将其用于lambda函数,您需要将一些文件从该实例复制到您上传到lambda的zip文件中.我将张贴所有命令以获取包含您需要的所有文件的zip文件.
At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.
6)压缩在lambda上运行tesseract所需的所有内容
cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..
mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..
cd ..
zip -r tesseract-lambda.zip tesseract-lambda
tesseract-lambda.zip文件包含lambda运行tesseract所需的所有内容.最后要做的是在zip文件的根目录中添加lambda函数,并将其上传到lambda.这是一个我尚未测试但应该可以使用的示例.
The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.
7)创建一个名为main.py的文件,编写一个类似于上面的lambda函数,并将其添加到tesseract-lambda.zip的根目录:
from __future__ import print_function
import urllib
import boto3
import os
import subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get the bucket and object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
try:
print("Bucket: " + bucket)
print("Key: " + key)
imgfilepath = '/tmp/image.png'
jsonfilepath = '/tmp/result.txt'
exportfile = key + '.txt'
print("Export: " + exportfile)
s3.download_file(bucket, key, imgfilepath)
command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
LIB_DIR,
SCRIPT_DIR,
SCRIPT_DIR,
imgfilepath,
jsonfilepath,
)
try:
output = subprocess.check_output(command, shell=True)
print(output)
s3.upload_file(jsonfilepath, bucket, exportfile)
except subprocess.CalledProcessError as e:
print(e.output)
except Exception as e:
print(e)
print('Error processing object {} from bucket {}.'.format(key, bucket))
raise e
在AWS控制台上创建AWS Lambda函数时,上传zip文件并将Hanlder设置为main.lambda_handler.这将告诉AWS Lambda在zip中查找main.py文件并调用函数lambda_handler.
When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.
重要
AWS Lambda的环境有时会发生变化.例如,lambda env的当前图像是amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2(当您阅读此答案时,可能不是这个图像).如果tesseract开始返回分段错误,请在Lambda函数上运行"ldd tesseract",并查看输出所需的lib(当前为libtesseract.so.3 liblept.so.5 libpng12.so.0).
From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).
感谢您的评论,SergioArcos.
Thanks for the comment, SergioArcos.
这篇关于通过virtualenv在AWS Lambda上的Tesseract OCR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!