优化OCR黑/白像素算法 [英] Optimized OCR black/white pixel algorithm

查看:166
本文介绍了优化OCR黑/白像素算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一组有限的字符一个简单的OCR解决方案。也就是说,我知道确切的方式在字母表中所有26个字母的样子。我使用C#和我能够很容易地确定给定的像素应该被视为黑色或白色。

I am writing a simple OCR solution for a finite set of characters. That is, I know the exact way all 26 letters in the alphabet will look like. I am using C# and am able to easily determine if a given pixel should be treated as black or white.

我生成的黑/白像素为每一个字符的矩阵。因此,例如,字母I(大写字母i),可能类似于以下内容:

I am generating a matrix of black/white pixels for every single character. So for example, the letter I (capital i), might look like the following:

01110
00100
00100
00100
01110

请注意:所有的点,这是我在这个岗位以后使用,假设左上角像素(0,0),右下角的像素为(4,4)。 1的重present黑色像素,和0重present白色像素。

Note: all points, which I use later in this post, assume that the top left pixel is (0, 0), bottom right pixel is (4, 4). 1's represent black pixels, and 0's represent white pixels.

我会创建在C#中相应的矩阵是这样的:

I would create a corresponding matrix in C# like this:

CreateLetter("I", new List<List<bool>>() {
  new List<bool>() { false, true,  true, true,  false },
  new List<bool>() { false, false, true, false, false },
  new List<bool>() { false, false, true, false, false },
  new List<bool>() { false, false, true, false, false },
  new List<bool>() { false, true,  true, true,  false }
});

我知道我可以用一个多维数组,而不是可能优化这部分,但让我们忽略了现在,这是用于说明目的。每个字母是完全一样的尺寸,10px的由11像素(10px的由11像素是一个字符在我的真正的程序的实际尺寸。我在此张贴简化这5px的由5像素,因为它是很容易画用字母0与1对较小的图像)。

I know I could probably optimize this part by using a multi-dimensional array instead, but let's ignore that for now, this is for illustrative purposes. Every letter is exactly the same dimensions, 10px by 11px (10px by 11px is the actual dimensions of a character in my real program. I simplified this to 5px by 5px in this posting since it is much easier to "draw" the letters using 0's and 1's on a smaller image).

现在,当我给它一个10px的由11像素图像的一部分使用OCR来分析,那就需要对每一个像素(10 * 11 = 110),这将意味着2860的每一个字母(26)运行(26 * 110)迭代(在最差的情况下)的每一个字符。

Now when I give it a 10px by 11px part of an image to analyze with OCR, it would need to run on every single letter (26) on every single pixel (10 * 11 = 110) which would mean 2,860 (26 * 110) iterations (in the worst case) for every single character.

我想这可以通过定义每个字符的独特特性进行优化。因此,举例来说,假设该字符集只包含5个不同的字母:I,A,O,B和L.这可能如下所示:

I was thinking this could be optimized by defining the unique characteristics of every character. So, for example, let's assume that the set of characters only consists of 5 distinct letters: I, A, O, B, and L. These might look like the following:

01110  00100  00100  01100  01000
00100  01010  01010  01010  01000
00100  01110  01010  01100  01000
00100  01010  01010  01010  01000
01110  01010  00100  01100  01110

分析每个字符的独特特性后,我可以显著减少需要被执行,以测试一个字符的试验次数。例如,对于我的性格,我可以定义它的独特的特征为具有在坐标一个黑象素(3,0),因为没有其他字符具有该象素为黑色。而不是在I字符测试110的像素进行匹配的话,我降低到一个1象素测试

After analyzing the unique characteristics of every character, I can significantly reduce the number of tests that need to be performed to test for a character. For example, for the "I" character, I could define it's unique characteristics as having a black pixel in the coordinate (3, 0) since no other characters have that pixel as black. So instead of testing 110 pixels for a match on the "I" character, I reduced it to a 1 pixel test.

这是它是什么样子,所有这些字符:

This is what it might look like for all these characters:

var LetterI = new OcrLetter() {
  Name = "I",
  BlackPixels = new List<Point>() { new Point (3, 0) }
}
var LetterA = new OcrLetter() {
  Name = "A",
  WhitePixels = new List<Point>() { new Point(2, 4) }
}
var LetterO = new OcrLetter() {
  Name = "O",
  BlackPixels = new List<Point>() { new Point(3, 2) },
  WhitePixels = new List<Point>() { new Point(2, 2) }
}
var LetterB = new OcrLetter() {
  Name = "B",
  BlackPixels = new List<Point>() { new Point(3, 1) },
  WhitePixels = new List<Point>() { new Point(3, 2) }
}
var LetterL = new OcrLetter() {
  Name = "L",
  BlackPixels = new List<Point>() { new Point(1, 1), new Point(3, 4) },
  WhitePixels = new List<Point>() { new Point(2, 2) }
}

这是具有挑战性的手工做的5个字符,并得到更加困难越大,所添加的信件量。你也想​​保证你有最起码的信独有的特色,因为你希望它是优化尽可能多地。

This is challenging to do manually for 5 characters and gets much harder the greater the amount of letters that are added. You also want to guarantee that you have the minimum set of unique characteristics of a letter since you want it to be optimized as much as possible.

我想创建一个算法,将找出所有字母的独特性,并会产生类似code到上面。然后我会用这个优化的黑/白矩阵识别字符。

I want to create an algorithm that will identify the unique characteristics of all the letters and would generate similar code to that above. I would then use this optimized black/white matrix to identify characters.

我如何把26个字母有他们所有的黑/白色像素填充(如CreateLetter code座),并将其转换为用于定义一个信独有的特点(一组优化,例如新OcrLetter( )code座)?怎么我会保证它是最有效的定义一套独特的特​​征(如,而不是定义6点作为独特的特点,有可能是一种方法,用1或2点做到这一点,因为字母I在我例如能)。

How do I take the 26 letters that have all their black/white pixels filled in (e.g. the CreateLetter code block) and convert them to an optimized set of unique characteristics that define a letter (e.g. the new OcrLetter() code block)? And how would I guarantee that it is the most efficient definition set of unique characteristics (e.g. instead of defining 6 points as the unique characteristics, there might be a way to do it with 1 or 2 points, as the letter "I" in my example was able to).

我已经想出了利用哈希表,这将是自2860迭代减少到110次迭代,一个26的时间减少的替代解决方案。这是它如何工作的:

An alternative solution I've come up with is using a hash table, which will reduce it from 2,860 iterations to 110 iterations, a 26 time reduction. This is how it might work:

我会填充它类似于以下数据:

I would populate it with data similar to the following:

Letters["01110 00100 00100 00100 01110"] = "I";
Letters["00100 01010 01110 01010 01010"] = "A";
Letters["00100 01010 01010 01010 00100"] = "O";
Letters["01100 01010 01100 01010 01100"] = "B";

现在,当我处理图像中达到的位置,我将其转换为一个字符串,如:01110 00100 00100 00100 01110,只是发现它的哈希表所示。这种解决方案似乎很简单,但是,这仍需要110次迭代来生成该字符串的每个字母

Now when I reach a location in the image to process, I convert it to a string such as: "01110 00100 00100 00100 01110" and simply find it in the hash table. This solution seems very simple, however, this still requires 110 iterations to generate this string for each letter.

在大O符号,该算法是因为O(110N)= O(2860N)= O(N)的N个字母来处理页面上的相同。然而,它仍然由26一个常数因子,一个显著改善了改进(例如,而不是它服用26分钟,这将需要1分钟)。

In big O notation, the algorithm is the same since O(110N) = O(2860N) = O(N) for N letters to process on the page. However, it is still improved by a constant factor of 26, a significant improvement (e.g. instead of it taking 26 minutes, it would take 1 minute).

更新:大多数提供的解决方案至今没有解决识别人物的独特的特点的问题,而不是提供替代解决方案。我仍然在寻找这个解决方案,据我所知,是为了达到最快的OCR处理的唯一途径。

Update: Most of the solutions provided so far have not addressed the issue of identifying the unique characteristics of a character and rather provide alternative solutions. I am still looking for this solution which, as far as I can tell, is the only way to achieve the fastest OCR processing.

我刚刚想出了部分解决:

I just came up with a partial solution:

有关的每个像素,在网格,存储它作为一个黑象素中的字母。

For each pixel, in the grid, store the letters that have it as a black pixel.

使用这些信件:

  I      A      O      B      L
01110  00100  00100  01100  01000
00100  01010  01010  01010  01000
00100  01110  01010  01100  01000
00100  01010  01010  01010  01000
01110  01010  00100  01100  01110

您也有这样的事情:

CreatePixel(new Point(0, 0), new List<Char>() {                         });
CreatePixel(new Point(1, 0), new List<Char>() { 'I',           'B', 'L' });
CreatePixel(new Point(2, 0), new List<Char>() { 'I', 'A', 'O', 'B'      });
CreatePixel(new Point(3, 0), new List<Char>() { 'I'                     });
CreatePixel(new Point(4, 0), new List<Char>() {                         });
CreatePixel(new Point(0, 1), new List<Char>() {                         });
CreatePixel(new Point(1, 1), new List<Char>() {      'A',      'B', 'L' });
CreatePixel(new Point(2, 1), new List<Char>() { 'I'                     });
CreatePixel(new Point(3, 1), new List<Char>() {      'A', 'O', 'B'      });
// ...
CreatePixel(new Point(2, 2), new List<Char>() { 'I', 'A',      'B'      });
CreatePixel(new Point(3, 2), new List<Char>() {      'A', 'O'           });
// ...
CreatePixel(new Point(2, 4), new List<Char>() { 'I',      'O', 'B', 'L' });
CreatePixel(new Point(3, 4), new List<Char>() { 'I', 'A',           'L' });
CreatePixel(new Point(4, 4), new List<Char>() {                         });

现在的每一个字母,以找到独特的特点,你需要看一下它水桶它属于,以及在水桶其他字符的数量。因此,让我们以我的例子。我们去所有的桶它属于(1,0; 2,0; 3,0; ...; 3,4),并看到一个与最少的其他人物是(3,0)。事实上,它只有1个字符,这意味着它必须是一个我在这种情况下,我们发现我们的独特的特点。

Now for every letter, in order to find the unique characteristics, you need to look at which buckets it belongs to, as well as the amount of other characters in the bucket. So let's take the example of "I". We go to all the buckets it belongs to (1,0; 2,0; 3,0; ...; 3,4) and see that the one with the least amount of other characters is (3,0). In fact, it only has 1 character, meaning it must be an "I" in this case, and we found our unique characteristic.

您也可以为像素,这将是白色的也这样做。注意,桶(2,0)包含除了L的所有字母,这意味着它可以作为一个白色像素测试。类似地,(2,4)不包含一个A。

You can also do the same for pixels that would be white. Notice that bucket (2,0) contains all the letters except for "L", this means that it could be used as a white pixel test. Similarly, (2,4) doesn't contain an 'A'.

桶,要么包含所有的字母或没有字母可以立即被丢弃,因为这些像素不禁定义一个独特的特性(如:1,1; 4,0; 0,1; 4,4)。

Buckets that either contain all the letters or none of the letters can be discarded immediately, since these pixels can't help define a unique characteristic (e.g. 1,1; 4,0; 0,1; 4,4).

这在'O'和'B'的情况变得棘手,当你没有一个字母一个1像素的测试,例如。让我们通过测试的'O'...

It gets trickier when you don't have a 1 pixel test for a letter, for example in the case of 'O' and 'B'. Let's walk through the test for 'O'...

它包含在下面的水桶:

// Bucket   Count   Letters
// 2,0      4       I, A, O, B
// 3,1      3          A, O, B
// 3,2      2          A, O
// 2,4      4       I,    O, B, L

此外,我们也有一些白色像素的测试,可以帮助:(我只列出了那些丢失最多2个)。失踪数计算为(5 - Bucket.Count)

Additionally, we also have a few white pixel tests that can help: (I only listed those that are missing at most 2). The Missing Count was calculated as (5 - Bucket.Count).

// Bucket   Missing Count   Missing Letters
// 1,0      2                  A, O
// 1,1      2               I,    O
// 2,2      2                     O,    L
// 3,4      2                     O, B

所以,现在我们可以采取最短的黑色像素桶(3,2),并看到,当我们测试(3,2),我们知道这或者是一个'A'或'O'。因此,我们需要一个简单的方法来讲述一个'A'和'O'之间的区别。我们既可以查找包含'O'而不是'A'黑色像素桶(如2,4)或含有'O',但不是'A'(如1,1)一个白色像素桶。任一这些可以结合使用的(3,2)的像素唯一地识别字母O,只有2测试

So now we can take the shortest black pixel bucket (3,2) and see that when we test for (3,2) we know it is either an 'A' or an 'O'. So we need an easy way to tell the difference between an 'A' and an 'O'. We could either look for a black pixel bucket that contains 'O' but not 'A' (e.g. 2,4) or a white pixel bucket that contains an 'O' but not an 'A' (e.g. 1,1). Either of these could be used in combination with the (3,2) pixel to uniquely identify the letter 'O' with only 2 tests.

这似乎是一个简单的算法时,有5个字符,但我将如何做到这一点的时候有26个字母和更多的像素重叠?例如,让我们说,(3,2)像素的测试之后,发现10包含的像素不同的字符(这是最少从所有的桶)。现在,我需要找到从9等字符,而不是只有1个其他角色的不同。我将如何实现我得到最少量的检查成为可能的目标,并确保我没有运行无关的检查吗?

This seems like a simple algorithm when there are 5 characters, but how would I do this when there are 26 letters and a lot more pixels overlapping? For example, let's say that after the (3,2) pixel test, it found 10 different characters that contain the pixel (and this was the least from all the buckets). Now I need to find differences from 9 other characters instead of only 1 other character. How would I achieve my goal of getting the least amount of checks as possible, and ensure that I am not running extraneous tests?

推荐答案

我没有答案,但这里有您的最终解决一些界限:

I don't have an answer, but here are some bounds on your eventual solution:

那么你就需要至少上限(LOG2(字符数))像素。你将无法进行区分字母少位。你的情况,试图找到5个像素相当于找到5个像素的字母分割成独立的分区。它可能不是那么容易。

If you want a straight up "use X pixels as a key" then you'll need at least ceiling(log2(number of characters)) pixels. You won't be able to disambiguate letters with less bits. In your case, trying to find the 5 pixels is equivalent to finding 5 pixels that split the letters into independent partitions. It probably isn't that easy.

您也可以用白痴的(heheh)<一href="http://stackoverflow.com/questions/2249908/optimized-ocr-black-white-pixel-algorithm/2250088#2250088">suggestion并建立基于要扫描类似 Huffman编码语言的字母频率的树。这将占用更多的空间比5位每封信,但可能会是较小的假设幂律分布< />信的使用。我会去使用这种方法,因为它允许你搜索每个节点的特定的分区,而不是寻找一组分区。

You can also use Moron's (heheh) suggestion and build a tree based on the letter frequencies of the language you are scanning similar to Huffman coding. That would take up more space than 5-bits per letter, but would probably be smaller assuming a power-law distribution of letter usage. I would go with this approach as it allows you to search for a specific partition for each node rather than searching for a set of partitions.

这篇关于优化OCR黑/白像素算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆