如何从包含表格数据的图像中提取数据? [英] How to extract data from image that contains tabular data?

查看:62
本文介绍了如何从包含表格数据的图像中提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pytesseract、pillow、cv2 对图像进行 OCR 并获取图像中的文本.由于我输入的是扫描的 PDF 文档,我首先将其转换为图像 (JPEG) 格式,然后尝试提取文本.我只走了一半.输入是一个表格,没有显示标题,因为标题有黑色背景.我也尝试过 getstructuringelement 但无法想出办法.这是我到目前为止所做的-

I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now-

import cv2
import os  
import numpy as np 
import pytesseract
#import pillow 

#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code- 
filename = path   
from pdf2image import convert_from_path 
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')


imgname = "path" 
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR) 
cv2.imshow("original image", oriimg)
cv2.waitKey(0)


#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC) 
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA) 
#here length height  
cv2.imshow("lol", img) 
cv2.waitKey(0) 
cv2.imwrite("changed_dimensionsimgpath", img)


import PIL.Image  
image = cv2.imread(imgname,cv2.IMREAD_COLOR) 
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg = 
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] 
cv2.imwrite("H://newim.jpg", grayedimg)


pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- 
OCR\tesseract.exe"


text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)

我的输入表如下所示.具有黑色背景的区域未被 OCR 识别,也未被提取为文本.任何帮助将不胜感激.

My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. Any help would be greatly appreciated.

此代码的图像样本输出-

Output of this code for the image sample-

Sun by Select .

F'I‘L‘Mlm":[ [Juir SHIIEF'. "fillfit Fadll'fi



Brand Type Fragranm Unit: Ithange Dollm 'LChanga Men
Eleanit' Sprayl Grange J.?IEBflI-Efl' 11% '5H'1Elfi9flflfl 35% I E
Eleanlt! kfimnsul' Grange IEEEESWI 39% I521LESM1MH 1113553 ‘ E
Dehuxe F‘mmr [emu 525.940 461% '51:EE?,GED,00 433.6% 5
Datum: Anus»! firing?) 4,3341%} 29% 513573300119 215% E
Dem Spray ‘Drangr: £432,100 09% 515.223.:53000 154%

Min Blaster Aemgul: Dramge "2114033111 59% :SHSiMMfl H94:

DiFlEIESIEf Sprawl Drama "NEW. 50% ‘5E1D1_E-BDM 141% I
Incredlme Spray Lem 1.513.410" 483% a HELENE] $11143 I E

t" In

1'"

推荐答案

cv2.imwrite(temp_filename, gray_img)后使用cv2就好了

Using cv2 is good after cv2.imwrite(temp_filename, gray_img)

import PIL.Image  
Use config='-psm 6'
page_str = image_to_string(Image.open(temp_filename), lang="eng", config='-psm 6')

这将从表格图像中返回良好的数据

This will return good data from table images

这篇关于如何从包含表格数据的图像中提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆