如何使用Tesseract分割文档，然后输出结果的边界框和标签 [英] How do I segment a document using Tesseract then output the resulting bounding boxes and labels

查看：157 发布时间：2020/5/19 19:23:49 ocr tesseract hocr

本文介绍了如何使用Tesseract分割文档，然后输出结果的边界框和标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试让Tesseract输出带有页面分割(OCR之前)产生的带有标记的边界框的文件.我知道它必须具备开箱即用"功能，因为ICDAR比赛显示了结果，参赛者必须进行分段和制作各种文件(

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:

我使用brew brew install tesseract --HEAD构建了最新版本的tesseract，并一直试图编辑/usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/中的配置文件以输出带有标签的框.使用hocr作为配置接收的输出，即

I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.

tesseract infile.tiff outfile_stem -l eng -psm 1 hocr

为所有内容提供边界框，并在class标签中添加一些标签，例如

gives a bounding box for everything and has some labelling in class tags e.g.

<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
    <span class='ocr_line' id='line_5_142' ...

但是我无法想象这一点.是否存在用于可视化hOCR文件的标准工具，或者是否具有使用Tesseract内置的边界框创建输出文件的功能?

but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?

当前head版本的详细信息:

The current head version details:

tesseract 3.04.00
 leptonica-1.71
  libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5

编辑

我真的很想使用命令行工具(如上面的示例)来实现这一目标. @nguyenq已将我指向 API参考，很遗憾，我没有c ++经验.如果唯一的解决方案是使用API，请提供一个简短的python示例吗?

Edit

I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?

代码

brew install wine  # takes a little while >10m
brew install gs    # only for generating a tif example. Not required, you can use Preview
brew install wget  # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs                          \
  -o paper-%d.tif           \
  -sDEVICE=tiff24nc         \
  -r300x300                 \
   paper.pdf 

cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"   \
  -inp-img "$dl/Downloads/paper-3.tif"           \
  -out-xml "$dl/Downloads/paper-3-tool.xml"      \
  -rec-mode layout>>log.txt

# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"

结果

带覆盖的文档(滚动查看文本和文字) 单独叠加(使用GUI按钮进行切换)

Results

Document with overlays (rollover to see text and type) Overlays alone (use GUI buttons to toggle)

您可以自己运行tesseract，并使用其他工具将其输出转换为PAGE格式.我无法使它正常工作，但我确定你会没事的！

You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!

# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel  # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml  # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"

此时，您需要使用 PAGE Converter Java工具将HOCR xml转换为PAGE xml.它应该像这样:

At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:

pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST

不幸的是，我一直在获取空指针.

Unfortunately, I kept getting null pointers.

Could not convert to target XML schema format.
java.lang.NullPointerException
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
    at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)

这篇关于如何使用Tesseract分割文档，然后输出结果的边界框和标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Tesseract分割文档，然后输出结果的边界框和标签 [英] How do I segment a document using Tesseract then output the resulting bounding boxes and labels

问题描述

编辑

Edit

推荐答案

代码

结果

Results

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Tesseract分割文档，然后输出结果的边界框和标签 [英] How do I segment a document using Tesseract then output the resulting bounding boxes and labels

问题描述

编辑

Edit

推荐答案

代码

结果

Results

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭