使用OpenCV检测表 [英] Detect table with OpenCV

查看:108
本文介绍了使用OpenCV检测表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常处理扫描纸.这些文件包含表格(类似于Excel表格),我需要手动将它们键入计算机中.为了使任务更糟,表可以具有不同的列数.至少可以说,将它们手动输入到Excel中是很平凡的事情.

I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.

我认为,如果我可以将程序放入OCR,则可以节省一周的工作.是否可以使用OpenCV和OCR在检测到的图像坐标后面的文本中检测标题文本区域.

I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.

我可以在OpenCV的帮助下实现这一目标吗?还是需要完全不同的方法?

Can I achieve this with the help of OpenCV or do I need entirely different approach?

示例表实际上只是一个标准表,类似于您在Excel和其他电子表格应用程序中看到的表,请参见下文.

Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.

推荐答案

这个问题似乎有点老了,但我也在研究类似的问题,并在此解释了自己的解决方案.

This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.

对于使用任何OCR引擎阅读文本,要获得良好的准确性,都会遇到很多挑战,其中包括以下几种主要情况:

For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:

  1. 由于图像质量差/背景区域中不想要的元素/斑点而出现噪音.这将需要一些预处理,例如噪声消除,可以使用高斯滤波器或常规中值滤波器方法轻松完成.这些也可以在 opencv 中找到.

图像方向错误:由于方向错误,OCR引擎无法正确分割图像中的行和单词,这会导致最差的准确性.

Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

在这种情况下,我认为扫描图像质量非常好且简单,可以使用以下步骤解决问题.

In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.

  1. 简单的图像二值化将删除背景内容,仅保留必要的内容,如下所示.
  2. 现在,我们必须删除在这种情况下为表格网格的线.也可以使用已连接的组件并删除较大的已连接组件来识别.因此,我们需要向OCR引擎提供的最终图像将如下所示.

  1. Simple image binarization will remove the background content leaving only necessary content as shown here.
  2. Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.

对于OCR,我们可以使用 Tesseract 开源OCR引擎.我从OCR得到以下结果:

For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:

标题标题

标题! header2 header3

header! header2 header3

row1cell1 row1cell2 row1cell3

row1cell1 row1cell2 row1cell3

row2cell1 row2cell2 row2cell3

row2cell1 row2cell2 row2cell3

正如我们在这里看到的那样,结果是相当准确的,但是存在一些类似的问题 header!(应该为 header1 ),这是因为OCR引擎被误解了!使用1.可以通过使用基于Regex的操作进一步处理结果来解决此问题.

As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.

在对OCR结果进行后处理之后,可以对其进行解析以读取行和列的值.

After post processing the OCR result it can be parsed to read the row and column values.

在这种情况下,也可以在此情况下对工作表标题,标题和常规单元格值进行分类,以使用其字体信息.

Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.

这篇关于使用OpenCV检测表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆