Tesseract OCR:解析表格单元格 [英] Tesseract OCR: Parsing table cells

查看：163 发布时间：2021/6/12 18:36:08 ocr tesseract

本文介绍了Tesseract OCR:解析表格单元格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 cmd 中的 Tesseract-OCR v4.0.0 (alpha?) 从下表的 png 中提取文本:

I am using Tesseract-OCR v4.0.0 (alpha?) from cmd to extract text from a png of a table shown below:

我想让 Tesseract-OCR 解析一个单元格中的内容，然后再转到下一个单元格.我不想转到行"中的下一个单词.

I wanted Tesseract-OCR to parse what was in one cell before moving on to the next. I do not want to move on to the next word in the 'line'.

预期:

<代码>...约翰史密斯 2017 年 3 月 7 日芝加哥密尔沃基底特律太平洋...

实际:

<代码>...约翰史密斯 2017 年 3 月 7 日芝加哥太平洋密尔沃基底特律...

我试过了:

使用 --psm 标志更改页面分段，从 0 到 13.结果通常相同但有细微差别或不可读.

有没有其他方法可以将 Tesseract 配置为在继续下一个单元格之前读取一个单元格的所有内容?否则，有什么解决方法吗?

Is there any other way to configure Tesseract to read all the contents of one cell before moving on to the next? Else, are there any workarounds?

去除图像中的噪声

table_c = cv2.GaussianBlur(cv2.cvtColor(table,cv2.COLOR_BGR2GRAY),(3,3),0,0)
# Threshold
_,thre = cv2.threshold(table_c,200,255,cv2.THRESH_BINARY,cv2.THRESH_OTSU)

仅获取图像中的行

kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(100,1))
morph = cv2.morphologyEx(thre,cv2.MORPH_CLOSE,kernel)
contours,h = cv2.findContours(morph, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
rows = [None]*len(contours)
for i, c in enumerate(contours):
    rows[i] = cv2.boundingRect(cv2.approxPolyDP(c, 3, True))
rows = sorted(rows, key=lambda b:b[1], reverse=False)

仅在图像中获取列

kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT,(1,50))
morph2 = cv2.morphologyEx(thre,cv2.MORPH_CLOSE,kernel2)
contours,h = cv2.findContours(morph2, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
table = cv2.drawContours(table, contours, 0, (0,255,0), 3)

cols = [None]*len(contours)
for i, c in enumerate(contours):
    cols[i] = cv2.boundingRect(cv2.approxPolyDP(c, 3, True))
cols = sorted(cols, key=lambda b:b[0], reverse=False)

删除行和列并仅保留文本

_,thre2 = cv2.threshold(thre,0,255,cv2.THRESH_BINARY_INV)
no_table = cv2.bitwise_and(morph,thre2)
no_table = cv2.bitwise_and(morph2,no_table)

kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT,(10,2))
mask = cv2.morphologyEx(no_table,cv2.MORPH_CLOSE,kernel2)

获取框中的每个文本

contours,h = cv2.findContours(mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
contours_poly = [None]*len(contours)
boundRect = [None]*len(contours)
for i, c in enumerate(contours):
    contours_poly[i] = cv2.approxPolyDP(c, 3, True)
    boundRect[i] = cv2.boundingRect(contours_poly[i])
#     cv2.rectangle(table, (int(boundRect[i][0]), int(boundRect[i][1])),
#                  (int(boundRect[i][0]+boundRect[i][2]), int(boundRect[i][1]+boundRect[i][3])), (0,0,255), 2)
# table = cv2.drawContours(table, contours, -1, (0,255,0), 3)

裁剪每个框并识别文本

获取每个文本的行和列以及它在图像中的位置

text_position = []
offest = 10
boundingBoxes = sorted(boundRect, key=lambda b:b[0], reverse=False)


for rect in boundingBoxes:
    if rect[2] > 30 and rect[3]>10:
        image = table[rect[1]-offest:rect[1]+rect[3]+offest,rect[0]-offest:rect[0]+rect[2]+offest]
        text = pytesseract.image_to_string(image)
        for i,row in enumerate(rows):
            if i < len(rows):
                if rect[1] >row[1] and rect[1] <rows[i+1][1]:
                    r = i 
                    break 
        for i,col in enumerate(cols):
            if i < len(cols):
                if rect[0] >col[0] and rect[0] <cols[i+1][0]:
                    c = i 
                    break
                    
        text_position.append({'Text':text.split("\n")[0],"row":r,'col':c,"X":rect[0],"Y":rect[1]})

组合同一行和列中的文本

indexs = []
for j,t in enumerate(text_position):
    list_re = []
    for i,tt in enumerate(text_position):
        if tt["row"] == t["row"] and tt["col"] == t["col"] :
            list_re.append(i)
    if len(list_re)>1:
        indexs.append(list_re)
        
indexs = list(set(tuple(i) for i in indexs))
text = ""
for indexs_ in indexs:
    text_repeated = [text_position[i] for i in indexs_]
    text_repeated = sorted(text_repeated, key=lambda b:b["Y"], reverse=False)
    for i in range(len(text_repeated)):
        text += text_repeated[i]["Text"]+" "
    new_dic = {'Text': text, 'row':text_repeated[0]["row"] , 'col': text_repeated[0]["col"], 'X': text_repeated[0]["X"], 'Y': text_repeated[-1]["Y"]}
    for i in indexs_:
        text_position.pop(i)
    text_position.append(new_dic)

最终输出将是一个字典列表，每个字典包含表格中每个单元格的文本、行和列

Final Output will be a list of dictionaries each contains text, row, and col of each cell in the table like below

[{'Text': 'Jane Doe', 'row': 3, 'col': 1, 'X': 67, 'Y': 167},
 {'Text': 'John Smith', 'row': 2, 'col': 1, 'X': 67, 'Y': 86},
 {'Text': 'Name', 'row': 1, 'col': 1, 'X': 68, 'Y': 59},
 {'Text': '07 March, 2017', 'row': 3, 'col': 2, 'X': 301, 'Y': 167},
 {'Text': '07 March, 2017', 'row': 2, 'col': 2, 'X': 301, 'Y': 86},
 {'Text': ' ', 'row': 1, 'col': 2, 'X': 302, 'Y': 59},
 {'Text': 'Los Angeles', 'row': 3, 'col': 3, 'X': 536, 'Y': 167},
 {'Text': 'Detroit', 'row': 2, 'col': 3, 'X': 536, 'Y': 140},
 {'Text': 'Locations', 'row': 1, 'col': 3, 'X': 536, 'Y': 58},
 {'Text': 'Currently in', 'row': 1, 'col': 4, 'X': 769, 'Y': 58},
 {'Text': 'Pacific Ocean', 'row': 2, 'col': 4, 'X': 770, 'Y': 85},
 {'Text': 'Chicago Milwaukee Detroit ',
  'row': 2,
  'col': 3,
  'X': 535,
  'Y': 140}

这篇关于Tesseract OCR:解析表格单元格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Tesseract OCR:解析表格单元格 [英] Tesseract OCR: Parsing table cells

问题描述

推荐答案

去除图像中的噪声

仅获取图像中的行

仅在图像中获取列

删除行和列并仅保留文本

获取框中的每个文本

裁剪每个框并识别文本

获取每个文本的行和列以及它在图像中的位置

组合同一行和列中的文本

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Tesseract OCR:解析表格单元格 [英] Tesseract OCR: Parsing table cells

问题描述

推荐答案

去除图像中的噪声

仅获取图像中的行

仅在图像中获取列

删除行和列并仅保留文本

获取框中的每个文本

裁剪每个框并识别文本

获取每个文本的行和列以及它在图像中的位置

组合同一行和列中的文本

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭