如何从包含表格数据的图像中提取数据? [英] How to extract data from image that contains tabular data?

查看：62 发布时间：2021/6/12 18:35:46 python opencv ocr tesseract python-tesseract

本文介绍了如何从包含表格数据的图像中提取数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 pytesseract、pillow、cv2 对图像进行 OCR 并获取图像中的文本.由于我输入的是扫描的 PDF 文档，我首先将其转换为图像 (JPEG) 格式，然后尝试提取文本.我只走了一半.输入是一个表格，没有显示标题，因为标题有黑色背景.我也尝试过 getstructuringelement 但无法想出办法.这是我到目前为止所做的-

I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now-

import cv2
import os  
import numpy as np 
import pytesseract
#import pillow 

#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code- 
filename = path   
from pdf2image import convert_from_path 
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')


imgname = "path" 
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR) 
cv2.imshow("original image", oriimg)
cv2.waitKey(0)


#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC) 
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA) 
#here length height  
cv2.imshow("lol", img) 
cv2.waitKey(0) 
cv2.imwrite("changed_dimensionsimgpath", img)


import PIL.Image  
image = cv2.imread(imgname,cv2.IMREAD_COLOR) 
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg = 
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] 
cv2.imwrite("H://newim.jpg", grayedimg)


pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- 
OCR\tesseract.exe"


text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)

我的输入表如下所示.具有黑色背景的区域未被 OCR 识别，也未被提取为文本.任何帮助将不胜感激.

My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. Any help would be greatly appreciated.

此代码的图像样本输出-

Output of this code for the image sample-

Sun by Select .

F'I‘L‘Mlm":[ [Juir SHIIEF'. "ﬁllﬁt Fadll'fi



Brand Type Fragranm Unit: Ithange Dollm 'LChanga Men
Eleanit' Sprayl Grange J.?IEBﬂI-Eﬂ' 11% '5H'1Elﬁ9ﬂﬂﬂ 35% I E
Eleanlt! kﬁmnsul' Grange IEEEESWI 39% I521LESM1MH 1113553 ‘ E
Dehuxe F‘mmr [emu 525.940 461% '51:EE?,GED,00 433.6% 5
Datum: Anus»! ﬁring?) 4,3341%} 29% 513573300119 215% E
Dem Spray ‘Drangr: £432,100 09% 515.223.:53000 154%

Min Blaster Aemgul: Dramge "2114033111 59% :SHSiMMﬂ H94:

DiFlEIESIEf Sprawl Drama "NEW. 50% ‘5E1D1_E-BDM 141% I
Incredlme Spray Lem 1.513.410" 483% a HELENE] $11143 I E

t" In

1'"

如何从包含表格数据的图像中提取数据? [英] How to extract data from image that contains tabular data?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从包含表格数据的图像中提取数据? [英] How to extract data from image that contains tabular data?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭