具有表格或行的文档的 Tesseract OCR 文本顺序 [英] Tesseract OCR text order for documents with tables or rows

查看:48
本文介绍了具有表格或行的文档的 Tesseract OCR 文本顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Tesseract OCR 将扫描的 PDF 转换为纯文本.总体而言,它非常有效,但我对扫描文本的顺序有疑问.当看起来更自然的方式是逐行扫描时,带有表格数据的文档似乎是逐列向下扫描的.一个非常小的例子是:

I am using Tesseract OCR to convert scanned PDF's into plain text. Overall it is highly effective but I am having issues with the order that the text is scanned. Documents with tabular data seem to scan down column by column when it seems like the more natural way would be to scan row by row. A very small scale example would be:

This is column A, row 1   This is column B, row 1    This is column C, row 1
This is column A, row 2   This is column B, row 2    This is column C, row 2

正在产生以下文本:

This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2

我开始阅读文档并使用 此处记录的参数 但如果有人已经解决了类似的问题,我会很感激您对修复程序的见解.也可能是一些训练数据,但我不知道具体是如何工作的.

I am starting to read documentation and do a guess and test, brute force approach with parameters documented here but if someone has already tackled an issue similar, I would appreciate the insight on the fix. It could also be some training data but I do not know exactly how that works.

推荐答案

尝试在单列之一中运行 tesseract 页面分割模式:

Try running tesseract in one of the single column Page Segmentation Modes:

tesseract input.tif output-filename --psm 6

默认情况下,Tesseract 在分割图像时需要一页文本.如果您只是想对小区域进行 OCR,请使用 -psm 参数尝试不同的分段模式.请注意,为裁剪得太紧的文本添加白色边框也可能有所帮助,请参阅问题 398.

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

要查看支持的页面分割模式的完整列表,请使用 tesseract -h.这是 3.21 的 [ed: excerpt only] 列表:

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the [ed: excerpt only] list as of 3.21:

  1. 全自动页面分割,但没有 OSD.(默认)
  2. 假设有一列大小可变的文本.
  3. 假设有一个统一的垂直对齐文本块.
  4. 假设有一个统一的文本块.

在此处查看示例:#using-different-page-segmentation-modes

这篇关于具有表格或行的文档的 Tesseract OCR 文本顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆