如何让Tesseract不要在单词中插入额外的空格? [英] How to keep Tesseract from inserting extra whitespace in words?
问题描述
我已经在
在本节中,当试图识别
字符串 CONSTRUCTORA 时,Tesseract遇到了问题。
它看到 CO NSTRUCTO RA
应该看到 CONSTRUCTORA
任何人都可以建议任何可能的解决方案?
这是命令行序列:
convert -density 600 my_pdf.pdf tmp.tif
tesseract -l spa tmp.tif stdout> tmp.txt
这些是软件版本:
〜%tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?):libjpeg 8d( libjpeg-turbo 1.3.0):libpng 1.2.50:
libtiff 4.0.3:zlib 1.2.8
~%convert --version
版本:ImageMagick 6.7.7-10 2014- 08-28 Q16 http://www.imagemagick.org
版权所有:Copyright(C)1999-2012 ImageMagick Studio LLC
Features:OpenMP
为了处理PDF文件的不规则字距,将建议调整文档 tosp_min_sane_kn_sp 周围的参数。 js / blob / master / docs / tesseract_parameters.mdrel =nofollow noreferrer> https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md
设置 tosp_min_sane_kn_sp = 2 .8
解决了问题中描述的问题。
新的Tesseract调用如下:
tesseract -c tosp_min_sane_kn_sp = 2.8 -l spa tmp.tif stdout> tmp.txt
tosp_min_sane_kn_sp
的默认值似乎是1.5。到目前为止,我只测试了大于1.5的值。
I asked about this on the Tesseract forum already
Via Tesseract (and ImageMagick), I'm trying to find out the text of this PDF file
This is the section of the PDF that I'm working on, it's line #7 of the PDF:
In this section, Tesseract is running into problems when trying to identify the string CONSTRUCTORA.
It sees CO NSTRUCTO RA
It should see CONSTRUCTORA
Can anyone suggest any possible fixes for this?
This is the commandline sequence:
convert -density 600 my_pdf.pdf tmp.tif
tesseract -l spa tmp.tif stdout > tmp.txt
These are the software versions:
~% tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8
~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
For dealing with the irregular kerning of the PDF file, Will suggested tweaking the parameters around tosp_min_sane_kn_sp
of the docs https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md
Setting tosp_min_sane_kn_sp=2.8
solved the issue that was described in the question.
The new Tesseract invocation is the following:
tesseract -c tosp_min_sane_kn_sp=2.8 -l spa tmp.tif stdout > tmp.txt
The default value for tosp_min_sane_kn_sp
seems to be 1.5. So far, I have only tested with values larger than 1.5.
这篇关于如何让Tesseract不要在单词中插入额外的空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!