如何使用Tesseract分割文档,然后输出结果的边界框和标签 [英] How do I segment a document using Tesseract then output the resulting bounding boxes and labels

查看:157
本文介绍了如何使用Tesseract分割文档,然后输出结果的边界框和标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试让Tesseract输出带有页面分割(OCR之前)产生的带有标记的边界框的文件.我知道它必须具备开箱即用"功能,因为ICDAR比赛显示了结果,参赛者必须进行分段和制作各种文件(

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:

我使用brew brew install tesseract --HEAD构建了最新版本的tesseract,并一直试图编辑/usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/中的配置文件以输出带有标签的框.使用hocr作为配置接收的输出,即

I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.

tesseract infile.tiff outfile_stem -l eng -psm 1 hocr

为所有内容提供边界框,并在class标签中添加一些标签,例如

gives a bounding box for everything and has some labelling in class tags e.g.

<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
    <span class='ocr_line' id='line_5_142' ...

但是我无法想象这一点.是否存在用于可视化hOCR文件的标准工具,或者是否具有使用Tesseract内置的边界框创建输出文件的功能?

but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?

当前head版本的详细信息:

The current head version details:

tesseract 3.04.00
 leptonica-1.71
  libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5


编辑

我真的很想使用命令行工具(如上面的示例)来实现这一目标. @nguyenq已将我指向 API参考,很遗憾,我没有c ++经验.如果唯一的解决方案是使用API​​,请提供一个简短的python示例吗?


Edit

I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?

推荐答案

成功.非常感谢模式识别和图像分析研究实验室(PRImA)的工作人员生产用于处理此问题的工具.您可以在其网站

Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.

以下,我为运行10.10并使用 homebrew 程序包管理器的Mac提供了完整的解决方案.我使用葡萄酒来运行Windows可执行文件.

Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.

  1. 下载工具: Tesseract OCR到页面(TPT)和网页查看器(PVT)
  2. 使用TPT在文档上运行tesseract,并将HOCR xml转换为PAGE xml
  3. 使用PVT观看原始图像,并覆盖PAGE xml信息
  1. Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
  2. Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
  3. Use the PVT to view the original image with the PAGE xml information overlaid

代码

brew install wine  # takes a little while >10m
brew install gs    # only for generating a tif example. Not required, you can use Preview
brew install wget  # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs                          \
  -o paper-%d.tif           \
  -sDEVICE=tiff24nc         \
  -r300x300                 \
   paper.pdf 

cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"   \
  -inp-img "$dl/Downloads/paper-3.tif"           \
  -out-xml "$dl/Downloads/paper-3-tool.xml"      \
  -rec-mode layout>>log.txt

# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"

结果

带覆盖的文档(滚动查看文本和文字) 单独叠加(使用GUI按钮进行切换)

Results

Document with overlays (rollover to see text and type) Overlays alone (use GUI buttons to toggle)

您可以自己运行tesseract,并使用其他工具将其输出转换为PAGE格式.我无法使它正常工作,但我确定你会没事的!

You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!

# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel  # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml  # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"

此时,您需要使用 PAGE Con​​verter Java工具将HOCR xml转换为PAGE xml.它应该像这样:

At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:

pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST

不幸的是,我一直在获取空指针.

Unfortunately, I kept getting null pointers.

Could not convert to target XML schema format.
java.lang.NullPointerException
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
    at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)

这篇关于如何使用Tesseract分割文档,然后输出结果的边界框和标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆