如何在Python中使用tesseract ocr获取结构格式中的信息? [英] How to fetch info in structure formate with tesseract ocr in Python?

查看:38
本文介绍了如何在Python中使用tesseract ocr获取结构格式中的信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Ubuntu.

这是我从网上得到的图片.

我关心的是获取图像中格式化的数据

并将其转储到文本文件中(必须保持位置(准确度为 95-97%))

我正在使用

几乎相同的问题在这里

我的代码-:

导入 cv2导入 pytesseract从 pytesseract 导入输出将 numpy 导入为 npimg = cv2.imread("/demo.jpg")d1 = pytesseract.image_to_data(img)打印(d1)

它给了我完全错误的输出结果

简而言之,我想将此图像(带对齐)转换为文本文件(或 CSV 文件).

解决方案

您可以在 HOCR 中使用 tesseract 输出来保留位置信息.将这些类型的文档直接转换为保留位置信息的文本是一个非常棘手和困难的问题.我可以给你一个中间解决方案,它可以为你提供一个包含每个单词及其坐标的数据框,以便你可以解析它以使用坐标提取键值信息.

### 这会将 tesseract 输出保存为demo.hocr"pytesseract.pytesseract.run_tesseract("演示.jpg", "演示",extension='.html', lang='eng', config="hocr")

HOCR 是一种类似 HTML 的表示形式,其中包含许多元数据,例如行信息、单词信息、其坐标等.为了更好地处理,我有一个解析器可以直接解析它并为您提供一个包含单词及其坐标的数据框.为此,我在 pip 中创建了一个名为 tesseract2dict 的包.您可以使用 pip install tesseract2dict 轻松安装它这就是您可以使用它的方式.

导入 cv2从 tesseract2dict 导入 TessToDicttd=TessToDict()inputImage=cv2.imread('path/to/image.jpg')### 功能 1### 这是用于将字级信息作为数据帧获取word_dict=td.tess2dict(inputImage,'outputName','outfolder')### 功能 2### 这是用于获取给定坐标的纯文本为 (x,y,w,h)text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))

PS:这个包只兼容 Tesseract 5.0.0

I am using Ubuntu.

Here is my Image that i get from internet.

My concern is to get data as it is formated in the Image

and dump it into the Text file (position has to be maintained (95-97% accuracy))

I am working with tesseract-ocr

almost same question is here

my code-:

import cv2
import pytesseract
from pytesseract import Output
import numpy as np

img = cv2.imread("/demo.jpg")

d1 = pytesseract.image_to_data(img)

print(d1)

It gives me completely a wrong output from what I am expecting

In short, I want to convert this Image(with alignment) to text file (or CSV file).

解决方案

You can use tesseract output in HOCR to retain positional information. Converting these kinds of documents directly into text retaining positional information is a very tricky and hard problem. I can give you an intermediate solution that can give you a data frame with each word and its coordinates so that you can parse it to extract key-value information using the coordinates.

### this will save the tesseract output as "demo.hocr" 
pytesseract.pytesseract.run_tesseract(
            "demo.jpg", "demo",
            extension='.html', lang='eng', config="hocr")

HOCR is an HTML like representation that contains a lot of metadata like line information, word information, its coordinates, etc present. For better handling, I have a parser that will directly parse it and give you a data frame with words and its coordinates. I have created a package in pip called tesseract2dict for this. You can easily install it using pip install tesseract2dict This is how you can use that.

import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
### function 1
### this is for getting word level information as a dataframe
word_dict=td.tess2dict(inputImage,'outputName','outfolder')

### function 2
### this is for getting plain text for a given coordinates as (x,y,w,h)
text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))

PS: This package is only compatible with Tesseract 5.0.0

这篇关于如何在Python中使用tesseract ocr获取结构格式中的信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆