如何让Tesseract不要在单词中插入额外的空格? [英] How to keep Tesseract from inserting extra whitespace in words?

查看:1634
本文介绍了如何让Tesseract不要在单词中插入额外的空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在



在本节中,当试图识别
字符串 CONSTRUCTORA 时,Tesseract遇到了问题。



它看到 CO NSTRUCTO RA



应该看到 CONSTRUCTORA



任何人都可以建议任何可能的解决方案?



这是命令行序列:

  convert -density 600 my_pdf.pdf tmp.tif 
tesseract -l spa tmp.tif stdout> tmp.txt

这些是软件版本:

 〜%tesseract --version 
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?):libjpeg 8d( libjpeg-turbo 1.3.0):libpng 1.2.50:
libtiff 4.0.3:zlib 1.2.8
~%convert --version
版本:ImageMagick 6.7.7-10 2014- 08-28 Q16 http://www.imagemagick.org
版权所有:Copyright(C)1999-2012 ImageMagick Studio LLC
Features:OpenMP


解决方案

为了处理PDF文件的不规则字距,建议调整文档 tosp_min_sane_kn_sp 周围的参数。 js / blob / master / docs / tesseract_parameters.mdrel =nofollow noreferrer> https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md



设置 tosp_min_sane_kn_sp = 2 .8 解决了问题中描述的问题。



新的Tesseract调用如下:

  tesseract -c tosp_min_sane_kn_sp = 2.8 -l spa tmp.tif stdout> tmp.txt 

tosp_min_sane_kn_sp 的默认值似乎是1.5。到目前为止,我只测试了大于1.5的值。


I asked about this on the Tesseract forum already

Via Tesseract (and ImageMagick), I'm trying to find out the text of this PDF file

This is the section of the PDF that I'm working on, it's line #7 of the PDF:

In this section, Tesseract is running into problems when trying to identify the string CONSTRUCTORA.

It sees CO NSTRUCTO RA

It should see CONSTRUCTORA

Can anyone suggest any possible fixes for this?

This is the commandline sequence:

convert -density 600 my_pdf.pdf tmp.tif 
tesseract -l spa tmp.tif stdout > tmp.txt 

These are the software versions:

~% tesseract --version 
tesseract 3.05.01 
leptonica-1.74.4 
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
libtiff 4.0.3 : zlib 1.2.8 
~% convert --version 
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org 
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC 
Features: OpenMP 

解决方案

For dealing with the irregular kerning of the PDF file, Will suggested tweaking the parameters around tosp_min_sane_kn_sp of the docs https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md

Setting tosp_min_sane_kn_sp=2.8 solved the issue that was described in the question.

The new Tesseract invocation is the following:

tesseract -c tosp_min_sane_kn_sp=2.8 -l spa tmp.tif stdout > tmp.txt

The default value for tosp_min_sane_kn_sp seems to be 1.5. So far, I have only tested with values larger than 1.5.

这篇关于如何让Tesseract不要在单词中插入额外的空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆