什么是典型的方法来使用OCR文字分开连接的信 [英] What is the typical method to separate connected letters in a word using OCR

查看:157
本文介绍了什么是典型的方法来使用OCR文字分开连接的信的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很新的OCR几乎一无所知用于识别单词的算法。我刚刚熟悉了。

I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.

可能有人请告知用来识别典型的方法和独立的个体字符连接的形式(我指的是在一个字,所有字母连在一起)?忘记手写,假设字母连在一起使用已知的字体,什么是决定一个字每一个个性的最佳方法是什么?当字符写入分别是没有问题的,但是当它们结合在一起,我们应该知道,每一个单个的字符开始和结束,以进入下一个步骤,并分别匹配他们的信件。 是否有任何已知的算法是什么?

Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter. Is there any known algorithm for that?

推荐答案

该标准术语这个过程是字符分割 - 分割是图像处理术语打破图像到分组区域进行识别。 阿拉伯字符切分抛出了很多在谷歌的点击率如果您想了解更多的学者的。

The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.

我会鼓励你看看的tesseract - 一个开源的OCR实施< /一>,尤其是文件的。

功能词汇有一点关于这一点,但有一吨的此处信息

Feature as defined in the glossary has a bit on this, but there is a ton of information here.

的tesseract基本上解决了这个问题(从怎样的tesseract作品)通过查看斑点(非字母),然后结合这些斑点进言。这避免了您所描述的问题,同时创造了新的问题。

Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.

有关阿拉伯(正如你所指出)的tesseract不起作用。我不知道很多关于这方面的,但<一个href="http://www.ccis2k.org/iajit/index.php?option=com_content&task=view&id=373&Itemid=293"相对=nofollow>本文似乎暗示动态时间规整(DTW)是一个有用的技术。这将尝试拉长的话告诉他们匹配已知单词,又工作在字而不是一封信空间。

For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.

这篇关于什么是典型的方法来使用OCR文字分开连接的信的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆