Tesseract的hOCR输出是否真的包含每个字符的边界框和置信度? [英] Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

查看:117
本文介绍了Tesseract的hOCR输出是否真的包含每个字符的边界框和置信度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Tesseract常见问题解答,他们说您可以:

如何获取每个字符的坐标和置信度?

How can I get the coordinates and confidence of each character?

有 有两个选择.如果您不想参加编程,则可以 使用Tesseract的hocr输出格式(请阅读Tesseract手册页以获取 详细信息.)

There are two options. If you would rather not get into programming, you can use Tesseract's hocr output format (read the Tesseract manual page for details).

但是,当我创建样本hOCR输出(它是一个.html文件)时,边界框和置信度仅在单词级别可用.

But when I created a sample hOCR output (it's an .html file), the bounding boxes and confidence levels were only available at the word level.

我在这里想念东西吗?

我已将示例输入/输出添加为插图(调整了输入的大小).

I've added the sample input/output as illustration (the input is resized).

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "in2.tif"; bbox 0 0 1882 354'>
<div class='ocr_carea' id='block_1_1' title="bbox 78 59 457 100">
<p class='ocr_par'>
<span class='ocr_line' id='line_1_1' title="bbox 78 61 456 97"><span class='ocr_word' id='word_1_1' title="bbox 78 62 175 97"><span class='ocrx_word' id='xword_1_1' title="x_wconf -2">Dear</span></span> <span class='ocr_word' id='word_1_2' title="bbox 205 62 271 96"><span class='ocrx_word' id='xword_1_2' title="x_wconf -14">Mr:</span></span> <span class='ocr_word' id='word_1_3' title="bbox 303 61 456 97"><span class='ocrx_word' id='xword_1_3' title="x_wconf -2">Grover:</span></span></span>
</p>
</div>
<div class='ocr_carea' id='block_1_2' title="bbox 75 154 1842 317">
<p class='ocr_par'>
<span class='ocr_line' id='line_1_2' title="bbox 78 161 1787 210"><span class='ocr_word' id='word_1_4' title="bbox 78 161 111 196"><span class='ocrx_word' id='xword_1_4' title="x_wconf -2">If</span></span> <span class='ocr_word' id='word_1_5' title="bbox 137 161 270 205"><span class='ocrx_word' id='xword_1_5' title="x_wconf -2">you&#39;ve</span></span> <span class='ocr_word' id='word_1_6' title="bbox 298 162 393 197"><span class='ocrx_word' id='xword_1_6' title="x_wconf -1">been</span></span> <span class='ocr_word' id='word_1_7' title="bbox 422 161 571 206"><span class='ocrx_word' id='xword_1_7' title="x_wconf -3">looking</span></span> <span class='ocr_word' id='word_1_8' title="bbox 598 162 657 197"><span class='ocrx_word' id='xword_1_8' title="x_wconf -2">for</span></span> <span class='ocr_word' id='word_1_9' title="bbox 685 174 707 198"><span class='ocrx_word' id='xword_1_9' title="x_wconf -1">a</span></span> <span class='ocr_word' id='word_1_10' title="bbox 734 162 929 207"><span class='ocrx_word' id='xword_1_10' title="x_wconf -4">reporting</span></span> <span class='ocr_word' id='word_1_11' title="bbox 956 163 1031 198"><span class='ocrx_word' id='xword_1_11' title="x_wconf -1">tool</span></span> <span class='ocr_word' id='word_1_12' title="bbox 1059 162 1140 199"><span class='ocrx_word' id='xword_1_12' title="x_wconf -3">that</span></span> <span class='ocr_word' id='word_1_13' title="bbox 1168 164 1294 199"><span class='ocrx_word' id='xword_1_13' title="x_wconf -4">allows</span></span> <span class='ocr_word' id='word_1_14' title="bbox 1321 175 1428 200"><span class='ocrx_word' id='xword_1_14' title="x_wconf -1">users</span></span> <span class='ocr_word' id='word_1_15' title="bbox 1456 169 1494 200"><span class='ocrx_word' id='xword_1_15' title="x_wconf -3">to</span></span> <span class='ocr_word' id='word_1_16' title="bbox 1523 169 1649 200"><span class='ocrx_word' id='xword_1_16' title="x_wconf -2">create</span></span> <span class='ocr_word' id='word_1_17' title="bbox 1677 170 1787 210"><span class='ocrx_word' id='xword_1_17' title="x_wconf -3">great</span></span></span>
<span class='ocr_line' id='line_1_3' title="bbox 77 210 1841 260"><span class='ocr_word' id='word_1_18' title="bbox 77 210 226 256"><span class='ocrx_word' id='xword_1_18' title="x_wconf -3">looking</span></span> <span class='ocr_word' id='word_1_19' title="bbox 253 216 399 256"><span class='ocrx_word' id='xword_1_19' title="x_wconf -4">reports</span></span> <span class='ocr_word' id='word_1_20' title="bbox 427 211 581 256"><span class='ocrx_word' id='xword_1_20' title="x_wconf -3">quickly,</span></span> <span class='ocr_word' id='word_1_21' title="bbox 613 224 654 248"><span class='ocrx_word' id='xword_1_21' title="x_wconf -2">as</span></span> <span class='ocr_word' id='word_1_22' title="bbox 682 213 763 248"><span class='ocrx_word' id='xword_1_22' title="x_wconf -1">well</span></span> <span class='ocr_word' id='word_1_23' title="bbox 792 224 832 248"><span class='ocrx_word' id='xword_1_23' title="x_wconf -1">as</span></span> <span class='ocr_word' id='word_1_24' title="bbox 859 212 1056 258"><span class='ocrx_word' id='xword_1_24' title="x_wconf -4">providing</span></span> <span class='ocr_word' id='word_1_25' title="bbox 1083 212 1144 249"><span class='ocrx_word' id='xword_1_25' title="x_wconf -2">the</span></span> <span class='ocr_word' id='word_1_26' title="bbox 1173 214 1315 249"><span class='ocrx_word' id='xword_1_26' title="x_wconf -2">control</span></span> <span class='ocr_word' id='word_1_27' title="bbox 1344 215 1417 249"><span class='ocrx_word' id='xword_1_27' title="x_wconf -2">and</span></span> <span class='ocr_word' id='word_1_28' title="bbox 1445 214 1639 250"><span class='ocrx_word' id='xword_1_28' title="x_wconf -2">industrial</span></span> <span class='ocr_word' id='word_1_29' title="bbox 1667 215 1841 260"><span class='ocrx_word' id='xword_1_29' title="x_wconf -3">strength</span></span></span>
<span class='ocr_line' id='line_1_4' title="bbox 76 260 1370 306"><span class='ocr_word' id='word_1_30' title="bbox 76 261 243 296"><span class='ocrx_word' id='xword_1_30' title="x_wconf -2">features</span></span> <span class='ocr_word' id='word_1_31' title="bbox 272 260 353 297"><span class='ocrx_word' id='xword_1_31' title="x_wconf -2">that</span></span> <span class='ocr_word' id='word_1_32' title="bbox 381 273 427 297"><span class='ocrx_word' id='xword_1_32' title="x_wconf -1">an</span></span> <span class='ocr_word' id='word_1_33' title="bbox 458 261 499 297"><span class='ocrx_word' id='xword_1_33' title="x_wconf -2">IS</span></span> <span class='ocr_word' id='word_1_34' title="bbox 527 262 776 306"><span class='ocrx_word' id='xword_1_34' title="x_wconf -2">professional</span></span> <span class='ocr_word' id='word_1_35' title="bbox 804 263 1110 299"><span class='ocrx_word' id='xword_1_35' title="x_wconf -2">demands...look</span></span> <span class='ocr_word' id='word_1_36' title="bbox 1139 275 1184 299"><span class='ocrx_word' id='xword_1_36' title="x_wconf -1">no</span></span> <span class='ocr_word' id='word_1_37' title="bbox 1212 263 1370 299"><span class='ocrx_word' id='xword_1_37' title="x_wconf -3">further!</span></span></span>
</p>
</div>
</div>
</body>
</html>

推荐答案

您已经看到了:它不存在.

You've seen it: it isn't there.

因此,您可以修改Tesseract源代码以输出支持所需x_confs属性的hOCR格式,也可以使用其ResultIterator API类来获得字符(符号)级别的置信度(请确保在SetVariable("save_blob_choices", "T") >方法).

So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T") after Init method).

这篇关于Tesseract的hOCR输出是否真的包含每个字符的边界框和置信度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆