具有表格或行的文档的 Tesseract OCR 文本顺序 [英] Tesseract OCR text order for documents with tables or rows

查看：48 发布时间：2021/6/12 18:35:32 ocr tesseract

本文介绍了具有表格或行的文档的 Tesseract OCR 文本顺序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Tesseract OCR 将扫描的 PDF 转换为纯文本.总体而言，它非常有效，但我对扫描文本的顺序有疑问.当看起来更自然的方式是逐行扫描时，带有表格数据的文档似乎是逐列向下扫描的.一个非常小的例子是:

I am using Tesseract OCR to convert scanned PDF's into plain text. Overall it is highly effective but I am having issues with the order that the text is scanned. Documents with tabular data seem to scan down column by column when it seems like the more natural way would be to scan row by row. A very small scale example would be:

This is column A, row 1   This is column B, row 1    This is column C, row 1
This is column A, row 2   This is column B, row 2    This is column C, row 2

正在产生以下文本:

This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2

我开始阅读文档并使用此处记录的参数但如果有人已经解决了类似的问题，我会很感激您对修复程序的见解.也可能是一些训练数据，但我不知道具体是如何工作的.

I am starting to read documentation and do a guess and test, brute force approach with parameters documented here but if someone has already tackled an issue similar, I would appreciate the insight on the fix. It could also be some training data but I do not know exactly how that works.

具有表格或行的文档的 Tesseract OCR 文本顺序 [英] Tesseract OCR text order for documents with tables or rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

具有表格或行的文档的 Tesseract OCR 文本顺序 [英] Tesseract OCR text order for documents with tables or rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭