Tesseract OCR文字位置 [英] Tesseract OCR Text Position

查看:1057
本文介绍了Tesseract OCR文字位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tesseract进行OCR.我能够使应用程序正常工作并获得输出.在这里,我试图从发票中提取数据并获取提取的数据.但是输入文件中单词之间的间距必须与输出文件中的相似.我现在正在获取每个单词和坐标.我需要根据坐标导出到文本文件中

I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in output file.I am now getting each words and coordinates.I need to export to text file according to coordinates

代码示例:

            using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "eng", EngineMode.Default))
            {
                engine.DefaultPageSegMode = PageSegMode.AutoOsd;
                // have to load Pix via a bitmap since Pix doesn't support loading a stream.

                using (var image = new System.Drawing.Bitmap(imageFile.PostedFile.InputStream))
                {

                    Bitmap bmp = Resize(image, 1920, 1080);

                    using (var pix = PixConverter.ToPix(image))
                    {
                        using (var page = engine.Process(pix))
                        {
                            using (var iter = page.GetIterator())
                            {
                                iter.Begin();
                                do
                                {
                                    Rect symbolBounds;
                                    string path = Server.MapPath("~/Output/data.txt");
                                    if (iter.TryGetBoundingBox(PageIteratorLevel.Word, out symbolBounds))
                                    {
                                        // do whatever you want with bounding box for the symbol

                                    var curText = iter.GetText(PageIteratorLevel.Word);

                                        //WriteToTextFile(curText, symbolBounds, path);
                                        resultText.InnerText += curText;
                                        // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
                                    }
                                } while (iter.Next(PageIteratorLevel.Word));
                            }


                            meanConfidenceLabel.InnerText = String.Format("{0:P}", page.GetMeanConfidence());

                        }
                    }
                }
            }

这是一个输入和输出示例,显示了错误的间距.

Here is an example of input and output showing the wrong spacing.

推荐答案

您可以使用page.GetIterator()循环浏览页面中找到的项目.对于单个项目,您可以得到一个边界框",它是一个Tesseract.Rect(矩形结构),其中包含:X1Y1X2Y2坐标.

You can loop through found items in the page using page.GetIterator(). For the individual items you can get a 'bounding box', this is a Tesseract.Rect (rectangle struct) which contains: X1, Y1, X2, Y2 coordinates.

Tesseract.PageIteratorLevel myLevel = /*TODO*/;
using (var page = Engine.Process(img))
using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        if (iter.TryGetBoundingBox(myLevel, out var rect))
        {
            var curText = iter.GetText(myLevel);
            // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
        }
    } while (iter.Next(myLevel));
}


没有明确的方法可以使用输入中的位置来分隔输出中的文本.您将必须为此编写一些自定义逻辑.


There is no clear-cut way to use the positions in the input to space the text in the output. You're going to have to write some custom logic for that.

您可以使用以下类似的代码来估算文本左侧所需的空格数:

You might be able to estimate the number of spaces you need to the left of your text with something like this:

var padLeftSpaces = (int)Math.Round((rect.X1 / inputWidth) * outputWidthSpaces);

这篇关于Tesseract OCR文字位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆