Tesseract改进和图像预处理步骤 [英] Tesseract improvements and image pre-processing steps

查看:702
本文介绍了Tesseract改进和图像预处理步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Tesseract库,下面是Tesseract的输入,

I am working on Tesseract library and below is the input for the Tesseract,

在实施的最初阶段,我只使用了身份证的MRZ区域。
但实际意图是扫描整个文件并获取身份证中的所有文本。

At the initial step of implementation I have used only the "MRZ" zone of the ID card. But the actual intention is to scan the entire document and get all the texts in the ID card.

我已经完成本文档并提高Tesseract的质量第一步是图像应为300 dpi。

I have gone through this document and to improve quality of Tesseract th first step is the image should be 300 dpi.

1)如何将ios中捕获的摄像机图像转换为300 dpi?

1) How to convert the captured camera image in ios to 300 dpi?

2)什么应该是最好的Tesseract提供最佳输出的对比度和亮度水平?

2) What should be the best contrast and brigtness level for Tesseract to give best outputs?

3)是否还有其他预处理步骤可以应用于图像以获得良好的准确度?

3) Is there anyother pre-processing step that I can apply to an image to get good accuracy?

4)为了更准确,建议的图像分辨率是多少?

4) For better accuracy what is the recommended image resolution?

5)我使用过 int tesseract :: TESSDLL_API :: MeanTextConf获得置信度分数。对于每个角色的置信度得分,我有可能决定置信度得分是否高于某个百分比,那么识别的角色是否准确?如果我错了你可以解释一下MeanTextConf方法的使用吗?

5) I have used "int tesseract::TESSDLL_API::MeanTextConf" to get the confidence score. With this confidence score for each character is there a possibility that I can decide if the confidence score is above some percentage then the recognized character is accurate? If I am wrong can you please explain the use of "MeanTextConf" method?

推荐答案

我写了几篇通用的OCR博客文章前一段时间,图像预处理和OCR如何最好地工作。请在这里找到它们: http://www.ocr-it.com/user-scenario-process-digital-camera-pictures-and-ocr-to-extract-specific-numbers

I wrote several generic OCR blog posts on the image pre-processing and "how OCR works best" some time ago. Please find them here: http://www.ocr-it.com/user-scenario-process-digital-camera-pictures-and-ocr-to-extract-specific-numbers

一般来说,获得足够高的分辨率应该是第一步。低分辨率根本没有足够的信息来可靠地读取字符。然后我做自适应二值化,其中图像被转换为​​黑色&白色使用阈值,背景应该是药房,字符应保持清晰,没有额外的噪音或漏洞。然后,可选地,可以执行分割到各个字段并分别使用特定设置处理每个字段,例如数字的仅数字和性别字段的M | F等。

In general, getting high enough resolution should be the first step. Low resolution simply does not have enough information per letter to read characters reliably. Then I do adaptive binarization, where the image is converted to black & white using threshold where backgrounds should dispensary and characters should remain pretty clear, without extra noise or holes in them. Then, optionally, can perform segmentation into various fields and process each field separately with specific settings, such as "digits only" for the number, and "M|F" for sexe field, etc.

这篇关于Tesseract改进和图像预处理步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆