Tesseract OCR 水平读取而不是垂直读取 C# [英] Tesseract OCR Read Horizontally rather than Vertically C#

查看:42
本文介绍了Tesseract OCR 水平读取而不是垂直读取 C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个 C# .Net 应用程序,它使用 Tesseract 对 .tiff 文件进行光学字符识别 (OCR).这是一个例子:

We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example:

然后我们将数据输出到文本文件.但是,Tesseract 以垂直方式读取数据.在我的示例图像中,它将 tiff 读取为两列数据,并且数据是从 Tesseract 输出的数据,如下所示:

We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:

类型:日期:地址:城市:状态:所有者:业主类型:面积:抵押:123452017-04-06主街100号某城市一些状态约翰·多伊基本的10.25是的

TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes

我们想要的是 Tesseract 水平读取 tiff 文件并让输出看起来像这样:

What we want is Tesseract to read the tiff file horizontally and have the output look like this:

类型:12345日期:2017-04-06地址:100 Main St.城市:一些城市状态:某些状态所有者:约翰·多伊业主类型:主要面积:10.25抵押贷款:是

TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes

我们已经为 Tesseract 尝试了各种页面分割选项,但它们都产生了相同的结果.

We've tried the various Page Sementation options for Tesseract, but they all produce the same result.

有人遇到过同样的问题吗?有人有什么想法吗?

Has anyone run into this same issue? Anybody have any ideas?

推荐答案

我知道这是一个旧帖子,但我今天遇到了同样的问题.

I know this is an old post but I ran into the same problem today.

使用 engine.SetVariable("tessedit_pageseg_mode", 6); 设置分段模式不起作用.

setting the segmentation mode with engine.SetVariable("tessedit_pageseg_mode", 6); did not work.

出于某种原因,我没有在配置文件中找到它.

And for some reason I didnt find it in the config files.

解决方案:

engine.DefaultPageSegMode = PageSegMode.SingleBlock;

这篇关于Tesseract OCR 水平读取而不是垂直读取 C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆