Tesseract 用户模式 [英] Tesseract user-patterns

查看:23
本文介绍了Tesseract 用户模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道如何在 Tesseract 中使用用户模式(user_patterns_suffix)?你能告诉我如何处理它以及如何测试它的工作吗?我试图遵循 Tesseract 指南(Tesseract 用户模式,但我没有看到它对结果有任何影响.

谢谢.

解决方案

Tesseract 将模式用于一种正则表达式".如果假设您正在扫描具有相同格式数据的书籍,则可以使用它.一个模式可以用来告诉 Tesseract 期望什么格式,比如它如何期望用户单词中的单词.下面是 Tesseract 描述如何使用模式:

<块引用>

每个模式都可以包含任何非空白字符,但是只有包含来自相应语言的 unicharset 字符的模式才有用.

唯一的元字符是 \.要在模式中作为普通字符串使用,它应该用 \ 转义(例如,字符串 C:\Documents 应该在模式文件中写为 C:\\文档).

此函数支持非常有限的正则表达式语法.可以表达一个字符、某个字符类和实体在模式中应该重复的次数.

要表示字符类,请使用以下之一:

  • \c - UNICHARSET::get_isalpha() 为真的 unichar(字符)
  • \d - UNICHARSET::get_isdigit() 为真的 unichar
  • \n - UNICHARSET::get_isdigit()UNICHARSET::isalpha() 为真
  • \p - UNICHARSET::get_ispunct() 为真的 unichar
  • \a - UNICHARSET::get_islower() 为真的 unichar
  • \A - UNICHARSET::get_isupper() 为真的 unichar

\* 可以在每个字符或模式后指定,以指示在下一个字符/模式出现之前该字符/模式可以重复任意次数.

示例:

1-8\d\d-GOOG-411 将被扩展为字符串:1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.>

"ww.\n\*.com" 将被扩展为如下字符串:"ww.a.com" "ww.a123.com" ... "ww.ABCDefgHIJKLMNop.com"

注意:在选择要包含的模式时,请注意提供非常通用的模式会使 tesseract 运行更慢这一事实.例如,模式开头的 \n\* 将使 Tesseract 考虑每个分段的建议字符选择的所有组合,这将慢得令人无法接受.由于可能难以识别的潜在速度问题,每个用户模式必须至少具有 kSaneNumConcreteChars 开头的 unicharset 具体字符.

Any one know how to use the user patterns (user_patterns_suffix) in Tesseract? Could you advise me how to do with it and how to test it working? I tried to follow Tesseract guide (Tesseract user-patterns but I didn't see it affected the result at all.

Thanks.

解决方案

Tesseract uses a pattern to a a sort of "regular expression". It can be used if lets say you were scanning a book with data that was all in the same format. A pattern can be used to tell Tesseract what formats to expect, ike how it expect words in user-words. Below is how Tesseract describes how to use patterns:

Each pattern can contain any non-whitespace characters, however only the patterns that contain characters from the unicharset of the corresponding language will be useful.

The only meta character is \. To be used in a pattern as an ordinary string it should be escaped with \ (e.g. string C:\Documents should be written in the patterns file as C:\\Documents).

This function supports a very limited regular expression syntax. One can express a character, a certain character class and a number of times the entity should be repeated in the pattern.

To denote a character class use one of:

  • \c - unichar for which UNICHARSET::get_isalpha() is true (character)
  • \d - unichar for which UNICHARSET::get_isdigit() is true
  • \n - unichar for which UNICHARSET::get_isdigit() and UNICHARSET::isalpha() are true
  • \p - unichar for which UNICHARSET::get_ispunct() is true
  • \a - unichar for which UNICHARSET::get_islower() is true
  • \A - unichar for which UNICHARSET::get_isupper() is true

\* could be specified after each character or pattern to indicate that the character/pattern can be repeated any number of times before the next character/pattern occurs.

Examples:

1-8\d\d-GOOG-411 will be expanded to strings: 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.

"ww.\n\*.com" will be expanded to strings like: "ww.a.com" "ww.a123.com" ... "ww.ABCDefgHIJKLMNop.com"

Note: In choosing which patterns to include please be aware of the fact providing very generic patterns will make tesseract run slower. For example \n\* at the beginning of the pattern will make Tesseract consider all the combinations of proposed character choices for each of the segmentations, which will be unacceptably slow. Because of potential problems with speed that could be difficult to identify, each user pattern has to have at least kSaneNumConcreteChars concrete characters from the unicharset at the beginning.

这篇关于Tesseract 用户模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆