使用OCR为图像读取图像中的文本,该图像使用python具有两列或三列数据 [英] Read text from image using OCR for the image which have two columns or three columns of data using python

查看:113
本文介绍了使用OCR为图像读取图像中的文本,该图像使用python具有两列或三列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在示例图像(仅作为参考,我的图像将具有相同的样式)中,该页面具有完整的水平文本,其他页面具有两个水平的文本列.

In the example image (just a reference, my images will be of same pattern) a page which have full horizontal text and other have two horizontal column of text.

如何自动检测文档的模式并在python中的另一列数据之后读取?

How to automatically detect the pattern of the document and read one after the other column of data in python?.

我正在将Tesseract OCR与Psm 6一起使用,在那里它在水平读取,这是错误的.

I am using Tesseract OCR with Psm 6, where it is reading horizontally which is wrong.

推荐答案

一种方法是使用形态学运算和轮廓检测.

One way to accomplish this is using morphological operations and contour detection.

使用前者,您实际上将所有字符出血"成一个很大的块状斑点.使用后者,您可以在图像中找到这些斑点并提取看起来有趣(意味着足够大)的斑点.

With the former you essentially "bleed" all characters into a big chunky blob. With the latter, you locate these blobs in your image and extract the ones that seem interesting (meaning: big enough).

使用的脚本:

import cv2
import sys

SCALE = 4
AREA_THRESHOLD = 427505.0 / 2

def show_scaled(name, img):
    try:
        h, w  = img.shape
    except ValueError:
        h, w, _  = img.shape
    cv2.imshow(name, cv2.resize(img, (w // SCALE, h // SCALE)))

def main():
    img = cv2.imread(sys.argv[1])
    img = img[10:-10, 10:-10] # remove the border, it confuses contour detection
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    show_scaled("original", gray)

    # black and white, and inverted, because
    # white pixels are treated as objects in
    # contour detection
    thresholded = cv2.adaptiveThreshold(
                gray, 255,
                cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV,
                25,
                15
            )
    show_scaled('thresholded', thresholded)
    # I use a kernel that is wide enough to connect characters
    # but not text blocks, and tall enough to connect lines.
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (13, 33))
    closing = cv2.morphologyEx(thresholded, cv2.MORPH_CLOSE, kernel)

    im2, contours, hierarchy = cv2.findContours(closing, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    show_scaled("closing", closing)

    for contour in contours:
        convex_contour = cv2.convexHull(contour)
        area = cv2.contourArea(convex_contour)
        if area > AREA_THRESHOLD:
            cv2.drawContours(img, [convex_contour], -1, (255,0,0), 3)

    show_scaled("contours", img)
    cv2.imwrite("/tmp/contours.png", img)
    cv2.waitKey()

if __name__ == '__main__':
    main()

然后,您所需要做的就是计算轮廓的边界框,并从原始图像中将其剪切掉.添加一点边距,然后将整个东西输入到tesseract.

Then all you need is to compute the bounding box of the contour, and cut it from the original image. Add a bit of a margin and feed the whole thing to tesseract.

这篇关于使用OCR为图像读取图像中的文本,该图像使用python具有两列或三列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆