Tesseract空白页 [英] Tesseract empty page
问题描述
我使用tesseract来检测图像上的字符.
I use tesseract for detecting characters on image.
try
{
using (var engine = new TesseractEngine(@"C:\Users\ea\Documents\Visual Studio 2015\Projects\ocrtTest", "eng", EngineMode.Default))
{
using (var img = Pix.LoadFromFile(testImagePath))
{
Bitmap src = (Bitmap)Image.FromFile(testImagePath);
using (var page = engine.Process(img))
{
var text = page.GetHOCRText(1);
File.WriteAllText("test.html", text);
//Console.WriteLine("Text: {0}", text);
//Console.WriteLine("Mean confidence: {0}", page.GetMeanConfidence());
int p = 0;
int l = 0;
int w = 0;
int s = 0;
int counter = 0;
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
do
{
do
{
do
{
do
{
//if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
//{
// logger.Log("New block");
//}
if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
{
p++;//counts paragraph
//logger.Log("New paragraph");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
{
l++;//count lines
//logger.Log("New line");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.Word))
{
w++;//count words
//logger.Log("New word");
}
s++;//count symbols
//logger.Log(iter.GetText(PageIteratorLevel.Symbol));
// get bounding box for symbol
Rect symbolBounds;
if (iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds))
{
Rectangle dueDateRectangle = new Rectangle(symbolBounds.X1, symbolBounds.Y1, symbolBounds.X2 - symbolBounds.X1, symbolBounds.Y2 - symbolBounds.Y1);
rect = dueDateRectangle;
PixelFormat format = src.PixelFormat;
Bitmap cloneBitmap = src.Clone(dueDateRectangle, format);
MemoryStream ms = new MemoryStream();
cloneBitmap.Save(ms, ImageFormat.Png);
ms.Position = 0;
Image i = Image.FromStream(ms);
//i.Save(ms,System.Drawing.Imaging.ImageFormat.Png);
i.Save("character" + counter + ".bmp", ImageFormat.Png);
counter++;
}
} while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
// DO any word post processing here (e.g. group symbols by word)
} while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
} while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
} while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
} while (iter.Next(PageIteratorLevel.Block));
}
Console.WriteLine("Pragraphs = " + p);
Console.WriteLine("Lines = " + l);
Console.WriteLine("Words = " + w);
Console.WriteLine("Symbols = " + s);
}
当我的图像中包含大量文本时,它会起作用,但是当我的图像中只有一个字母时,它就不会起作用.
And it works when I have an image with a lot of text, but when I have an image with only one letter it does not.
它找到了一个符号,我在输入中看到了它.符号=1.但是它无法获取BoundingBox.为什么? 我使用字母图像的相同想法
It found a symbol, I see it in input. Symbols = 1. But it cant get BoundingBox. Why? The same whem I use alphabet image
推荐答案
您可能需要使用不同的page segmentation mode
和OCR Engine mode
测试OCR
,以获得最佳结果.以下是Tesseract 4.0
中可用的使用信息.
You may need to test the OCR
with different page segmentation mode
and OCR Engine mode
to get the best result. Below is the usage information available in Tesseract 4.0
.
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.<br>
OCR Engine modes:
0 Original Tesseract only.
1 Neural nets LSTM only.
2 Tesseract + LSTM.
3 Default, based on what is available.
例如
-
psm 8
对于单个单词OCR
会给出最佳结果 -
psm 6
可能会获得最好的文本效果
psm 8
would give the best result forOCR
a single wordpsm 6
may give the best result of a block of text
在您的代码中,它表明您已使用默认 engine mode
且未指定segmentation mode
.您可能还要进行一些测试,以找出哪种模式可以提供正确的结果.
In your code, it showed you have used the default engine mode
and not specified segmentation mode
. You may do some more tests to find out which modes give the correct result.
这篇关于Tesseract空白页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!