从tesseract hocr xhtml文件中提取数据 [英] Extract data from tesseract hocr xhtml file

查看:22
本文介绍了从tesseract hocr xhtml文件中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 从 Tesseract 的 hocr 输出文件中提取数据.我们仅限于 tesseact 3.04 版,因此没有 image_to_data 函数或 tsv 输出可用.我已经能够用 beautifulsoup 和 R 来做到这一点,但这在需要部署它的环境中都不可用.我只是想提取x_wconf"这个词和信心.下面是一个示例输出文件,我很乐意只返回 [90, 87, 89, 89] 和 ['the', '(quick)', '[brown]', '{fox}] 的列表','跳跃!'].

lxml 是环境中 elementtree 之外唯一可用的 xml 解析器,所以我对如何进行有点不知所措.

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><头><title></title><meta http-equiv="Content-Type" content="text/html;charset=utf-8"/><meta name='ocr-system' content='tesseract 3.05.00dev'/><meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/><身体><div class='ocr_page' id='page_1' title='image "./testing/eurotext.png";bbox 0 0 1024 800;ppageno 0'><div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661"><p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661"><span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113;基线 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1'title='bbox 105 66 178 97;x_wconf 90'><span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106;x_wconf 87'><strong>(快速)</strong></span><span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109;x_wconf 89'>[棕色]<span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110;x_wconf 89'>{fox}</span><span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113;x_wconf 89'>跳跃!</span></span></p>

</html>

解决方案

想出了一个(粗略的)使用 xpath 的方法.

def hocr_to_dataframe(fp):从 lxml 导入 etree将熊猫导入为 pd导入操作系统doc = etree.parse('fp')单词 = []wordConf = []对于 doc.xpath('//*') 中的路径:如果 path.values() 中的ocrx_word":conf = [x for x in path.values() if 'x_wconf' in x][0]wordConf.append(int(conf.split('x_wconf ')[1]))word.append(path.text)dfReturn = pd.DataFrame({'word' : words,'信心':wordConf})返回(df返回)

I'm trying to use Python to extract data from Tesseract's hocr output file. We're limited to tesseact version 3.04, so no image_to_data function or tsv output is available. I have been able to do it with beautifulsoup and in R, but that's neither are available in the environment in which it needs to be deployed. I am just trying to extract the word and confidence "x_wconf." An example output file is below, for which I'd be happy to just return lists of [90, 87, 89, 89] and ['the', '(quick)', '[brown]', '{fox}', 'jumps!'].

lxml is the only available xml parser outside of the elementtree in the environment so I'm a bit at a loss for how to proceed.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.05.00dev' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "./testing/eurotext.png"; bbox 0 0 1024 800; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661">
     <span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1_1' title='bbox 105 66 178 97; x_wconf 90'>The</span> <span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106; x_wconf 87'><strong>(quick)</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109; x_wconf 89'>[brown]</span> <span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110; x_wconf 89'>{fox}</span> <span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113; x_wconf 89'>jumps!</span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

解决方案

Figured out a (gross) way to do it using xpath.

def hocr_to_dataframe(fp):

    from lxml import etree
    import pandas as pd
    import os

    doc = etree.parse('fp')
    words = []
    wordConf = []

    for path in doc.xpath('//*'):
        if 'ocrx_word' in path.values():
            conf = [x for x in path.values() if 'x_wconf' in x][0]
            wordConf.append(int(conf.split('x_wconf ')[1]))
            words.append(path.text)

    dfReturn = pd.DataFrame({'word' : words,
                             'confidence' : wordConf})

    return(dfReturn)

这篇关于从tesseract hocr xhtml文件中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆