Tesseract OCR:解析表格单元格 [英] Tesseract OCR: Parsing table cells

查看:163
本文介绍了Tesseract OCR:解析表格单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 cmd 中的 Tesseract-OCR v4.0.0 (alpha?) 从下表的 png 中提取文本:

I am using Tesseract-OCR v4.0.0 (alpha?) from cmd to extract text from a png of a table shown below:

我想让 Tesseract-OCR 解析一个单元格中的内容,然后再转到下一个单元格.我不想转到行"中的下一个单词.

I wanted Tesseract-OCR to parse what was in one cell before moving on to the next. I do not want to move on to the next word in the 'line'.

预期:

<代码>...约翰史密斯 2017 年 3 月 7 日芝加哥密尔沃基底特律太平洋...

实际:

<代码>...约翰史密斯 2017 年 3 月 7 日芝加哥太平洋密尔沃基底特律...

我试过了:

  • 使用 --psm 标志更改页面分段,从 0 到 13.结果通常相同但有细微差别或不可读.

有没有其他方法可以将 Tesseract 配置为在继续下一个单元格之前读取一个单元格的所有内容?否则,有什么解决方法吗?

Is there any other way to configure Tesseract to read all the contents of one cell before moving on to the next? Else, are there any workarounds?

推荐答案

我有时会花时间回答这个问题,我看到很少有人问同样的问题.

I spend sometimes answering this question I saw few people asking the same question.

我这里使用的解决方案是在使用tesseract之前先使用Opencv对图像进行预处理.之后需要一些安排.对不起,我的代码很长,我认为有些可以缩短它.但无论如何它都能完成工作.我无法逐行解释代码,但我添加了注释,希望它可以对正在发生的事情提供一个大致的了解.

The solution I used here is to use Opencv to pre process the image before using tesseract. After that some arrangement is needed. Sorry My code is quit long I think some can make it shorter. But anyway it get the job done. I couldn't explain the code line by line but I added comments hope it can give a general idea about what is going on.

import cv2 
import numpy as np
import pytesseract 
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract"

读取图片和过滤

table = cv2.imread("Table.png")
# adding some Border around image 
table= cv2.copyMakeBorder(table,20,20,20,20,cv2.BORDER_CONSTANT,value=[255,255,255])

去除图像中的噪声

table_c = cv2.GaussianBlur(cv2.cvtColor(table,cv2.COLOR_BGR2GRAY),(3,3),0,0)
# Threshold
_,thre = cv2.threshold(table_c,200,255,cv2.THRESH_BINARY,cv2.THRESH_OTSU)

仅获取图像中的行

kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(100,1))
morph = cv2.morphologyEx(thre,cv2.MORPH_CLOSE,kernel)
contours,h = cv2.findContours(morph, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
rows = [None]*len(contours)
for i, c in enumerate(contours):
    rows[i] = cv2.boundingRect(cv2.approxPolyDP(c, 3, True))
rows = sorted(rows, key=lambda b:b[1], reverse=False)

仅在图像中获取列

kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT,(1,50))
morph2 = cv2.morphologyEx(thre,cv2.MORPH_CLOSE,kernel2)
contours,h = cv2.findContours(morph2, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
table = cv2.drawContours(table, contours, 0, (0,255,0), 3)

cols = [None]*len(contours)
for i, c in enumerate(contours):
    cols[i] = cv2.boundingRect(cv2.approxPolyDP(c, 3, True))
cols = sorted(cols, key=lambda b:b[0], reverse=False)

删除行和列并仅保留文本

_,thre2 = cv2.threshold(thre,0,255,cv2.THRESH_BINARY_INV)
no_table = cv2.bitwise_and(morph,thre2)
no_table = cv2.bitwise_and(morph2,no_table)

kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT,(10,2))
mask = cv2.morphologyEx(no_table,cv2.MORPH_CLOSE,kernel2)

获取框中的每个文本

contours,h = cv2.findContours(mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
contours_poly = [None]*len(contours)
boundRect = [None]*len(contours)
for i, c in enumerate(contours):
    contours_poly[i] = cv2.approxPolyDP(c, 3, True)
    boundRect[i] = cv2.boundingRect(contours_poly[i])
#     cv2.rectangle(table, (int(boundRect[i][0]), int(boundRect[i][1])),
#                  (int(boundRect[i][0]+boundRect[i][2]), int(boundRect[i][1]+boundRect[i][3])), (0,0,255), 2)
# table = cv2.drawContours(table, contours, -1, (0,255,0), 3)

裁剪每个框并识别文本

获取每个文本的行和列以及它在图像中的位置

text_position = []
offest = 10
boundingBoxes = sorted(boundRect, key=lambda b:b[0], reverse=False)


for rect in boundingBoxes:
    if rect[2] > 30 and rect[3]>10:
        image = table[rect[1]-offest:rect[1]+rect[3]+offest,rect[0]-offest:rect[0]+rect[2]+offest]
        text = pytesseract.image_to_string(image)
        for i,row in enumerate(rows):
            if i < len(rows):
                if rect[1] >row[1] and rect[1] <rows[i+1][1]:
                    r = i 
                    break 
        for i,col in enumerate(cols):
            if i < len(cols):
                if rect[0] >col[0] and rect[0] <cols[i+1][0]:
                    c = i 
                    break
                    
        text_position.append({'Text':text.split("\n")[0],"row":r,'col':c,"X":rect[0],"Y":rect[1]})
        

组合同一行和列中的文本

indexs = []
for j,t in enumerate(text_position):
    list_re = []
    for i,tt in enumerate(text_position):
        if tt["row"] == t["row"] and tt["col"] == t["col"] :
            list_re.append(i)
    if len(list_re)>1:
        indexs.append(list_re)
        
indexs = list(set(tuple(i) for i in indexs))
text = ""
for indexs_ in indexs:
    text_repeated = [text_position[i] for i in indexs_]
    text_repeated = sorted(text_repeated, key=lambda b:b["Y"], reverse=False)
    for i in range(len(text_repeated)):
        text += text_repeated[i]["Text"]+" "
    new_dic = {'Text': text, 'row':text_repeated[0]["row"] , 'col': text_repeated[0]["col"], 'X': text_repeated[0]["X"], 'Y': text_repeated[-1]["Y"]}
    for i in indexs_:
        text_position.pop(i)
    text_position.append(new_dic)

最终输出将是一个字典列表,每个字典包含表格中每个单元格的文本、行和列

Final Output will be a list of dictionaries each contains text, row, and col of each cell in the table like below

[{'Text': 'Jane Doe', 'row': 3, 'col': 1, 'X': 67, 'Y': 167},
 {'Text': 'John Smith', 'row': 2, 'col': 1, 'X': 67, 'Y': 86},
 {'Text': 'Name', 'row': 1, 'col': 1, 'X': 68, 'Y': 59},
 {'Text': '07 March, 2017', 'row': 3, 'col': 2, 'X': 301, 'Y': 167},
 {'Text': '07 March, 2017', 'row': 2, 'col': 2, 'X': 301, 'Y': 86},
 {'Text': ' ', 'row': 1, 'col': 2, 'X': 302, 'Y': 59},
 {'Text': 'Los Angeles', 'row': 3, 'col': 3, 'X': 536, 'Y': 167},
 {'Text': 'Detroit', 'row': 2, 'col': 3, 'X': 536, 'Y': 140},
 {'Text': 'Locations', 'row': 1, 'col': 3, 'X': 536, 'Y': 58},
 {'Text': 'Currently in', 'row': 1, 'col': 4, 'X': 769, 'Y': 58},
 {'Text': 'Pacific Ocean', 'row': 2, 'col': 4, 'X': 770, 'Y': 85},
 {'Text': 'Chicago Milwaukee Detroit ',
  'row': 2,
  'col': 3,
  'X': 535,
  'Y': 140}

这篇关于Tesseract OCR:解析表格单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆