Tesseract OCR力模式 [英] Tesseract OCR force pattern

查看:112
本文介绍了Tesseract OCR力模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想像下面的文章一样用Tesseract读取特定的字符序列: Tesseract OCR:是否可以强制使用特定模式?

I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

我尝试了 bazaar 匹配模式模式为\d\d\d\A\A和ocr的Tesseract仍然可以识别不匹配的其他单词.

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and ocr still recognize other words which doesn't match.

我尝试使用"tessedit_char_whitelist"参数,但无法使用该参数选择字符的位置.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

  • 我启动命令:tesseract image.jpg result -l eng bazaar 我收到此消息:
  • I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

请在模式开头至少提供4个具体字符

Please provide at least 4 concrete characters at the beginning of the pattern

无效的用户模式\A\A\d\d\d

带有Leptonica的Tesseract开源OCR引擎v3.01

Tesseract Open Source OCR Engine v3.01 with Leptonica

  • image.jpg:
    • 结果:

    • The result :

    AB123
    ABC12
    A1234
    12345
    ABCD1
    

    所以错了,我只想捕捉序列"AB123".

    So it is wrong, I just wanted to catch the sequence "AB123".

    有人可以告诉我为什么我的用户模式文件中的正则表达式无效吗?对于配置,我严格遵循了集市教程.

    Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.

    推荐答案

    请尝试将此模式与量词一起使用.

    Try using this pattern with quantifiers instead.

    [a-zA-Z]{2}\d{3}
    

    这应该只覆盖2个字母字符和3个数字.

    This should cover only 2 alphabetical characters and 3 digits.

    您之前匹配所有内容的原因是\ w是字母数字.

    The reason why you are matching everything before is because \w is alphanumeric.

    这篇关于Tesseract OCR力模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆