使用 OpenCV 检测表 [英] Detect table with OpenCV

查看:29
本文介绍了使用 OpenCV 检测表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常使用扫描的文件.论文包含表格(类似于 Excel 表格),我需要手动将其输入到计算机中.使任务变得更糟的表可以是不同数量的列.至少可以说,手动将它们输入 Excel 是很平常的事情.

我认为如果我可以将程序放入 OCR 中,我可以节省一周的工作时间.是否可以使用 OpenCV 检测标题文本区域,并 OCR 检测到的图像坐标后面的文本.

我可以在 OpenCV 的帮助下实现这一点还是需要完全不同的方法?

示例表格实际上只是一个标准表格,类似于您在 Excel 和其他电子表格应用程序中看到的表格,见下文.

解决方案

这个问题似乎有点老了,但我也在研究类似的问题,并得到了我自己的解决方案,我在这里解释.

对于使用任何 OCR 引擎阅读文本,在获得良好准确性方面存在许多挑战,其中包括以下主要情况:

  1. 由于图像质量差/背景区域中不需要的元素/斑点而存在噪声.这将需要一些预处理,如噪声去除,这可以使用高斯滤波器或正常中值滤波器方法轻松完成.这些也可以在

  2. 现在我们必须删除在这种情况下是表格网格的线.这也可以使用连接的组件和删除大的连接组件来识别.因此,我们需要提供给 OCR 引擎的最终图像将如下所示.

  3. 对于 OCR,我们可以使用 Tesseract 开源 OCR 引擎.我从 OCR 中得到以下结果:

    字幕标题

    标题!header2 header3

    row1cell1 row1cell2 row1cell3

    row2cell1 row2cell2 row2cell3

  4. 正如我们在此处看到的,结果非常准确,但存在一些问题,例如header!应该是header1,这是因为OCR引擎误解了!with 1. 这个问题可以通过使用基于正则表达式的操作进一步处理结果来解决.

OCR 结果后处理后,可以解析读取行和列值.

同样在这种情况下,可以使用它们的字体信息对工作表标题、标题和正常单元格值进行分类.

I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.

I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.

Can I achieve this with the help of OpenCV or do I need entirely different approach?

Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.

解决方案

This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.

For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:

  1. Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.

  2. Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

  3. Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.

  1. Simple image binarization will remove the background content leaving only necessary content as shown here.
  2. Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.

  3. For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:

    Caption title

    header! header2 header3

    row1cell1 row1cell2 row1cell3

    row2cell1 row2cell2 row2cell3

  4. As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.

After post processing the OCR result it can be parsed to read the row and column values.

Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.

这篇关于使用 OpenCV 检测表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆