如何使 Tesseract 更快 [英] How to Make Tesseract Faster

查看:203
本文介绍了如何使 Tesseract 更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个远景,但我不得不问.我需要任何可能使 Tesseract OCR 引擎更快的想法.我正在处理由大约 2000 万页文本组成的 200 万个 PDF,我需要尽可能地发挥性能.目前的估计是,如果我什么都不做,这将需要大约一年的时间才能完成.

This is a long shot, but I have to ask. I need any ideas that might make Tesseract OCR engine faster. I'm processing 2M PDFs consisting of about 20M pages of text, and I need to get every bit of performance that I can. Current estimate is that this will take about a year to complete, if I do nothing.

我已经调整了输入图像以在那里获得一些提升,但我需要考虑其他方法.我不认为对图像的改进会让我在这一点上有所作为.

I've tweaked the input images to get some boosts there, but I need to think about other approaches. I don't think improvements to the images will get me anywhere at this point.

例如:

  • 可以使用优化标志或类似的东西重新编译 Tesseract 吗?
  • 共享 CPU 内存或 GPU 能否投入使用?
  • 我能否以某种方式告诉 Tesseract 使用更多内存(我有很多)?
  • 是否还有其他方法可以使受 CPU 限制的 C++ 程序更快?

目前,Tesseract 由我们的任务运行器 Celery 运行,它使用多处理来完成其工作.这样,我可以使服务器看起来像这样:

Currently, Tesseract is being run by our task runner, Celery, which uses multi-processing to do its work. This way, I can make the server look like this:

我(显然?)不知道我在说什么,因为我是一名 Python 开发人员,而 Tesseract 是用 C++ 编写的,但如果有任何方法可以在这里得到提升,我很乐意提供想法.

I (obviously?) don't know what I'm talking about because I'm a Python developer and Tesseract is written in C++, but if there's any way to get a boost here, I'd love ideas.

推荐答案

我也有巨大的 OCR 需求,而 Tesseract 速度慢得令人望而却步.我最终选择了一个与此类似的自定义前馈网络.不过,您不必自己构建它;你可以使用像 Nervana neon 这样的高性能库,它恰好很容易使用.

I also have huge OCR needs and Tesseract is prohibitively slow. I ended up going for a custom feedforward net similar to this one. You don't have to build it yourself, though; you can use a high-performance library like Nervana neon, which happens to be easy to use.

那么问题有两个部分:

1) 将字符与非字符分开.
2) 将字符馈送到网络.

1) Separate characters from non-characters.
2) Feed characters to the net.

假设您分批输入 1000 个字符,将每个字符的大小调整为 8 x 8(64 像素),并且要识别 26字母(小写和大写)和 10 个数字和 10 个特殊字符(总共 72 个字形).然后解析所有 1000 个字符最终成为两个(非关联!)矩阵产品:

Let's say you feed characters in batches of size 1000, that you resize each character to dimensions 8 x 8 (64 pixels), and that you want to recognize 26 letters (lowercase AND uppercase) and 10 digits and 10 special characters (72 glyphs total). Then parsing all 1000 characters ends up being two (non-associative!) matrix products:

(AB) 点 C.

A 将是一个 1000 x 64 矩阵,B 将是一个 64 x 256 矩阵,C 将是一个 256 x 72 矩阵.

A would be a 1000 x 64 matrix, B would be a 64 x 256 matrix, C would be a 256 x 72 matrix.

对我来说,这比 Tesseract 快几个数量级.只需对您的计算机执行这些矩阵产品(元素是浮点数)的速度进行基准测试.

For me, this is several orders of magnitude faster than Tesseract. Just benchmark how fast your computer can do those matrix products (the elements are floats).

矩阵乘积是非关联的,因为在第一个乘积之后,您必须应用一个称为 ReLU 的(廉价)函数.

The matrix products are non-associative because after the first one you have to apply a (cheap) function called a ReLU.

我花了几个月的时间才从头开始制作整个辣酱玉米饼馅,但 OCR 是我项目的重要组成部分.

It took me a few months to get this whole enchilada to work from scratch, but OCR was a major part of my project.

此外,分割字符也很重要.根据您的 PDF,它可以是任何内容,从简单的计算机视觉练习到人工智能中的开放研究问题.

Also, segmenting characters is non-trivial. Depending on your PDFs, it can be anything from an easy exercise in computer vision to an open research problem in artificial intelligence.

我并不是说这是最简单或最有效的方法……这就是我所做的!

I'm not claiming this is the easiest or most effective way to do this... This is simply what I did!

这篇关于如何使 Tesseract 更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆