强制Tesseract匹配模式(连续四位数) [英] Forcing Tesseract to match pattern (four digits in a row)

查看:501
本文介绍了强制Tesseract匹配模式(连续四位数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使Tesseract(使用Tess4J包装器)仅匹配特定的模式.模式是连续四位数,我认为应该是\ d \ d \ d \ d.这是我正在馈送tesseract的图像的非常小子集(平面图受到限制,因此我谨慎发布更多):

I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771

我正在使用以下Java代码:

I'm using the following java code:

    File imageFile = new File("/<redacted>/file.pdf");

    Tesseract instance = Tesseract.getInstance();
    instance.setTessVariable("load_system_dawg", "F");
    instance.setTessVariable("load_freq_dawg", "F");
    instance.setTessVariable("user_words_suffix", "");
    instance.setTessVariable("user_patterns_suffix", "\\d\\d\\d\\d");

    try {
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }

我遇到的问题是tesseract似乎不遵守这些配置选项,结果中仍然出现文本/单词.我希望只得到房间号(例如2950).

The problem I'm running into is that tesseract seems to not be honoring these configuration options, I still get text/words in the results. I expect to get only the room numbers (ex. 2950).

推荐答案

您尚未正确配置此设置.

You have not configured this correctly.

user_patterns_suffix用于指示包含您的模式的文本文件的文件扩展名,例如

user_patterns_suffix is meant to indicate the file extension of a text file that contains your patterns, e.g.

user_patterns_suffix pats

这意味着您需要将文件放入tesseract tessdata文件夹中

would mean you need to put a file in the tesseract tessdata folder

tessdata/eng.pats

...假设eng是您使用的语言.

... assuming eng was the language you were using.

在此处查看更多信息:

http://tesseract-ocr.googlecode. com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

我确实记得用户模式可能不少于模式之前的6个固定字符,因此无论如何您都无法完成此操作-但请先尝试正确的配置.

I do recall that user patterns may not be any shorter than 6 fixed chars before a pattern so you may not be able to accomplish this in any case - but try the correct config first.

这篇关于强制Tesseract匹配模式(连续四位数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆