如何在tesseract中保留文档结构 [英] How to preserve document structure in tesseract
问题描述
我正在使用tesseract ocr从图像中提取文本.保持文档的结构对我来说非常重要.当前,tesseract不会保留结构,实际上会更改文本的顺序.我的输入是下图.
I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.
我得到的输出如下:
Someto the left
Someto the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them
Some with some space between them
Sometext here
Sometext here
this much
this much
如何从图像的相同结构中获得所需的输出?
即如下:
Some text here
Some text here
Some to the left
Some to the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them this much
Some with some space between them this much
推荐答案
较新版本的tesseract(3.04)具有一个名为preserve_interword_spaces
的选项,该选项可以完成您想要的操作.
Newer versions of tesseract (3.04) have an option called preserve_interword_spaces
which should do what you want.
请注意,tesseract在单词之间检测到的空格数在相似的行之间可能并不总是相同的.因此,可能无法以这种方式输出左对齐的单词(在您的示例中)之前的一系列空格-preserve_interword_spaces
选项不会尝试做任何花哨的事,它只会保留找到的空格.默认情况下,tesseract将空间分成几部分.
Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces
option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.
有关此选项的详细信息,请此处.
Details on this option are here.
这篇关于如何在tesseract中保留文档结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!