在tesseract OCR参数中定义多列? [英] Define multiple columns in tesseract OCR parameters?

查看:37
本文介绍了在tesseract OCR参数中定义多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在每页包含 6 列的历史报纸上使用 OCR.目前我使用 FineReader 并为每列定义文本块.我想使用 Tesseract.Tesseract 得到的列大多是正确的,但每隔几行就会读入相邻的列.我想知道是否有一种方法可以设置它的参数,以便六列看起来非常僵硬.

根据其他问题的建议,我尝试使用

显然引擎正在制作一个包含缩进线的块,另一个包含齐平线.

确认这是flush行的文本输出:

<预><代码>trpops 的杂货店、酒吧和咖啡店驻扎在开罗城堡.在上午 10 点之前收到此项服务的投标,1906 年 14 日星期六,星期六.亲自向指挥官申请,Citadel,在上午 10 点到每天中午12点.——_——_——

有没有办法将 tesseract 限制到某些列边界?(显然我可以通过剪切图像来做到这一点,但我想避免这项工作.)

解决方案

you can user

psm 4 OEM 1

或 psm 4 oem 3获得更好的文字和准确性

I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.

Following suggestions on other questions, I've tried playing with --psm and hocr without great success.

Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:

Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.

Confirming this is the text output of the flush lines:


Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.

to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.

application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——

Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)

解决方案

you can user

psm 4 oem 1

or psm 4 oem 3 to get better text and accuracy

这篇关于在tesseract OCR参数中定义多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆